PostgreSQL WAL Archiving Bottlenecks: Solving pgBackRest [045] Checksum Mismatch
PostgreSQL WAL Archiving Bottlenecks: Solving pgBackRest [045] Checksum Mismatch
Introduction
Imagine a stressful morning: your production database logs are screaming archive command failed with exit code 45. Your pg_wal directory is swelling with 300+ unsent files, and disk space is disappearing fast. This is the story of a critical intervention in a live system and how we cleared the path without losing our Point-in-Time Recovery (PITR) capability.
1. The Bottleneck: A Mismatch at the Gate
PostgreSQL archiving is sequential. If a WAL file (e.g., 0000006700003EE500000000) is partially uploaded or corrupted on the Repository (Repo) side, pgBackRest halts the entire line for safety.

- The Error:
already exists in the repo1 archive with a different checksum - The Impact: Because the checksums don’t match, pgBackRest refuses to overwrite the file. The “broken” file acts like a dam, causing hundreds of WALs to pile up locally.
2. Why a Full Backup Isn’t the Answer
The first instinct might be to trigger a fresh full backup. However:
- The Dam Remains: A full backup uses the same transport layer. If the archival gate is locked, the backup might fail or, worse, leave the 323 WAL files clogging your disk.
- Crash Risk: While the backup runs for hours, your disk could hit 100%, causing a database PANIC and shutdown.

3. The Step-by-Step Rescue Operation
Step 1: Extend the Patience (Timeout Settings)
When hundreds of files are queued, the default 60s timeout isn’t enough. Update your pgbackrest.conf to be more patient:
[global]
archive-timeout=1200
protocol-timeout=1200
process-max=2
Step 2: Remove the Ghost File from the Repo
Go to your Backup Server (Repo). Find the specific WAL file that pgBackRest is complaining about and remove it. Don’t worry — the healthy original is still on your database server.
# On the Repo Server
find /var/lib/pgbackrest/archive -name "0000006700003EE500000000*" -exec rm -v {} \;
# Or step by step
find -name "0000006700003EE500000000*"
rm ./prod_backup/13-1/0000006700003EE5/0000006700003EE500000000*
Step 3: Jumpstart the Engine
Once the corrupted copy is gone, tell pgBackRest to resume and force a WAL switch to wake up the archiver:
# On the Database Server (Optinal)
pgbackrest --stanza=your_stanza start # it can be start if your stanza stop
psql -c "SELECT pg_switch_wal();" # if required switch new wall
4. The Result: Flow Restored
The moment the gatekeeper file is removed, pgBackRest sees the path is clear. It takes the healthy local version, uploads it, and then flushes the remaining 300+ files in minutes. Your disk space returns, and your PITR chain is perfectly intact.
Conclusion
In a live production environment, don’t be afraid to remove a corrupted copy to save the original flow. If you don’t remove the blockage at the end of the pipe, the pressure will eventually break the system
← PostgreSQL Blog