Case Study: How a 7 Second Disk Latency Caused a Database Outage (Patroni & etcd)

In a High Availability (HA) architecture, the stability of the entire system often depends on the most fundamental layer: disk I/O…

Case Study: How a 7 Second Disk Latency Caused a Database Outage (Patroni & etcd)

In a High Availability (HA) architecture, the stability of the entire system often depends on the most fundamental layer: disk I/O performance. We recently investigated an incident where disk write latencies (fdatasync) in our etcd nodes spiked up to 7 seconds, leading to a “brain death” of the entire database cluster. Here is a technical deep dive into that outage.

The Incident: 120 Seconds of Disk Failure

Monitoring logs revealed that between 05:26 and 05:28, two etcd nodes (etcd1 and etcd2) began reporting a critical warning: “slow fdatasync”.

Under normal conditions, a secure disk write operation (fdatasync) should complete in well under a second. However, during this window, latencies surged to between 5 and 15 seconds.

Why did the outage occur?

Loss of Quorum: etcd operates on a three-node cluster. Because this latency hit two nodes simultaneously, the cluster lost its majority (quorum).
Service Unavailability: Since etcd could not write to disk or reach a consensus internally, it became unresponsive to external requests — specifically the leader lock renewal requests from services like Patroni.
Cascading Effect: Upper-layer services (Patroni, etc.), unable to reach etcd, demoted themselves to protect the integrity of the cluster. This resulted in a total system-wide database outage.

Root Cause Analysis: Architectural Vulnerabilities

This was not a software bug; it was a fundamental infrastructure architecture failure. Our investigation highlighted three critical points:

Collocation on Physical Hardware: Two out of the three etcd nodes were running on the same physical ESXi host. A disk I/O bottleneck at the host level paralyzed two-thirds of the cluster at once.
The Thin Provisioning Trap: The disks were configured as Thin Provisioned (allocating space on demand). This prevented the deterministic performance required by etcd’s WAL (Write-Ahead Log). The 7-second hangs occurred precisely when the system struggled to allocate new disk blocks during heavy writes.
Resource Contention and DRS: Automatic VM migrations (vMotion) and an unpredictable load on shared storage clusters created an environment unsuitable for latency-sensitive systems like etcd.

The Solution: Transitioning to Deterministic Infrastructure

To prevent these ghost latencies, we reconfigured the infrastructure with a focus on predictability:

Thick Provision Eager Zeroed: All disks were converted to the “Thick Eager Zeroed” format, where blocks are pre-allocated and zeroed out. This eliminated “wait time” during write operations.
Physical Isolation (Anti-Affinity): We enforced Anti-Affinity rules, ensuring each etcd node runs on a different physical host. Now, a single host failure cannot take down the cluster’s quorum.
Dedicated Storage: We moved etcd nodes away from complex, shared datastore clusters to individual, high-performance VMFS datastores to guarantee I/O consistency.

Conclusion

This case demonstrates just how sensitive distributed systems like etcd are to underlying hardware performance. If your infrastructure experiences a 7-second disk latency, it doesn’t matter how advanced your upper-layer software (Patroni, Kubernetes, etc.) is; the system will break at its weakest link.

In short: If etcd can’t write to disk, the cluster can’t talk. If the cluster can’t talk, the system can’t survive.