Avoiding Split-Brain Using Watchdog/softdog
Avoiding Split-Brain Using Watchdog/softdog
Split-brain syndrome is a highly dangerous situation in high availability (HA) systems where more than one node believes itself to be the leader at the same time. In systems like databases, where consistency is critical, this can compromise data integrity. In a PostgreSQL HA cluster, Patroni uses a DCS (Distributed Consensus Store) to determine leadership. However, in some exceptional cases, Patroni can become suspended, killed, or lose communication with the DCS. This is where the watchdog comes in. A watchdog is a kernel-level timer, and if certain defined operations in the system do not reset this timer at regular intervals, the system is automatically restarted. This ensures intervention when the system becomes non-responsive.

Why Watchdog?
Patroni uses TTL (Time to Live) via a DCS (etcd, Consul, Zookeeper, etc.) to control write access by the primary node. However, in extreme cases (e.g., if the Patroni process hangs or is unexpectedly killed), PostgreSQL might still be running and could still accept writes from the outside. To mitigate this split-brain scenario, Patroni can be integrated with a software watchdog (softdog) to add another layer of protection. When the watchdog is active, Patroni will not start PostgreSQL as primary under unsafe conditions, and it can trigger a system reboot if necessary to eliminate the risk of split-brain.
Using Watchdog with Patroni
Watchdog support is built into Patroni. However, your system must have the softdog module loaded and active, and Patroni must have access to this watchdog device.
Install Required Packages
On RedHat based systems:
sudo dnf install watchdog
How to Enable Softdog
Load the softdog module
sudo modprobe softdog
Allow Patroni user (e.g., postgres) to access the watchdog device
sudo chown postgres /dev/watchdog
Load softdog automatically on reboot
sudo sh -c 'echo "modprobe softdog" >> /etc/rc.modules' sudo chmod +x /etc/rc.modules
Create a udev rule for persistent device access
sudo sh -c 'echo "KERNEL==\"watchdog\", MODE=\"0666\"" >> /etc/udev/rules.d/61-watchdog.rules'
Patroni Configuration
Add the following block to your patroni.yml configuration file:
watchdog:
mode: automatic # Options: off, automatic, required
device: /dev/watchdog
#safety_margin: -1 # Optional: reduce delay
With mode: automatic, Patroni will use the watchdog if it exists. If set to required and the watchdog is not active, Patroni will not run providing a stricter safeguard against split-brain.
What is Safety Margin?
By default, the safety_margin is set to 5. This indicates how many seconds before the TTL expiration Patroni should release leadership. To avoid delays and let the watchdog trigger at half the TTL duration, you can set it to -1.
Conclusion
The watchdog/softdog mechanism provides a fail-safe shutdown strategy in high availability setups. In PostgreSQL clusters using DCS-based solutions like Patroni, watchdog integration can effectively prevent severe issues like split-brain. In this guide, we’ve covered both the general concepts and step-by-step implementation on RedHat systems. If system stability and data integrity are critical to your infrastructure, don’t overlook the importance of setting up a watchdog.
← PostgreSQL Blog