Logo ← PostgreSQL Blog

Avoiding Split-Brain Using Watchdog/softdog

Split-brain syndrome is a highly dangerous situation in high availability (HA) systems where more than one node believes itself to be the…

Avoiding Split-Brain Using Watchdog/softdog

Split-brain syndrome is a highly dangerous situation in high availability (HA) systems where more than one node believes itself to be the leader at the same time. In systems like databases, where consistency is critical, this can compromise data integrity. In a PostgreSQL HA cluster, Patroni uses a DCS (Distributed Consensus Store) to determine leadership. However, in some exceptional cases, Patroni can become suspended, killed, or lose communication with the DCS. This is where the watchdog comes in. A watchdog is a kernel-level timer, and if certain defined operations in the system do not reset this timer at regular intervals, the system is automatically restarted. This ensures intervention when the system becomes non-responsive.

Why Watchdog?

Patroni uses TTL (Time to Live) via a DCS (etcd, Consul, Zookeeper, etc.) to control write access by the primary node. However, in extreme cases (e.g., if the Patroni process hangs or is unexpectedly killed), PostgreSQL might still be running and could still accept writes from the outside. To mitigate this split-brain scenario, Patroni can be integrated with a software watchdog (softdog) to add another layer of protection. When the watchdog is active, Patroni will not start PostgreSQL as primary under unsafe conditions, and it can trigger a system reboot if necessary to eliminate the risk of split-brain.

Using Watchdog with Patroni

Watchdog support is built into Patroni. However, your system must have the softdog module loaded and active, and Patroni must have access to this watchdog device.

Install Required Packages

On RedHat based systems:

sudo dnf install watchdog

How to Enable Softdog

Load the softdog module

sudo modprobe softdog

Allow Patroni user (e.g., postgres) to access the watchdog device

sudo chown postgres /dev/watchdog

Load softdog automatically on reboot

sudo sh -c 'echo "modprobe softdog" >> /etc/rc.modules' sudo chmod +x /etc/rc.modules

Create a udev rule for persistent device access

sudo sh -c 'echo "KERNEL==\"watchdog\", MODE=\"0666\"" >> /etc/udev/rules.d/61-watchdog.rules'

Patroni Configuration

Add the following block to your patroni.yml configuration file:

watchdog:
  mode: automatic       # Options: off, automatic, required
  device: /dev/watchdog
  #safety_margin: -1     # Optional: reduce delay

With mode: automatic, Patroni will use the watchdog if it exists. If set to required and the watchdog is not active, Patroni will not run providing a stricter safeguard against split-brain.

What is Safety Margin?

By default, the safety_margin is set to 5. This indicates how many seconds before the TTL expiration Patroni should release leadership. To avoid delays and let the watchdog trigger at half the TTL duration, you can set it to -1.

Conclusion

The watchdog/softdog mechanism provides a fail-safe shutdown strategy in high availability setups. In PostgreSQL clusters using DCS-based solutions like Patroni, watchdog integration can effectively prevent severe issues like split-brain. In this guide, we’ve covered both the general concepts and step-by-step implementation on RedHat systems. If system stability and data integrity are critical to your infrastructure, don’t overlook the importance of setting up a watchdog.