PostgreSQL WAL Archiving Bottlenecks: Solving pgBackRest [045] Checksum Mismatch

Introduction

PostgreSQL WAL Archiving Bottlenecks: Solving pgBackRest [045] Checksum Mismatch

Introduction

Imagine a stressful morning: your production database logs are screaming archive command failed with exit code 45. Your pg_wal directory is swelling with 300+ unsent files, and disk space is disappearing fast. This is the story of a critical intervention in a live system and how we cleared the path without losing our Point-in-Time Recovery (PITR) capability.

1. The Bottleneck: A Mismatch at the Gate

PostgreSQL archiving is sequential. If a WAL file (e.g., 0000006700003EE500000000) is partially uploaded or corrupted on the Repository (Repo) side, pgBackRest halts the entire line for safety.

The Error: already exists in the repo1 archive with a different checksum
The Impact: Because the checksums don’t match, pgBackRest refuses to overwrite the file. The “broken” file acts like a dam, causing hundreds of WALs to pile up locally.

2. Why a Full Backup Isn’t the Answer

The first instinct might be to trigger a fresh full backup. However:

The Dam Remains: A full backup uses the same transport layer. If the archival gate is locked, the backup might fail or, worse, leave the 323 WAL files clogging your disk.
Crash Risk: While the backup runs for hours, your disk could hit 100%, causing a database PANIC and shutdown.

3. The Step-by-Step Rescue Operation

Step 1: Extend the Patience (Timeout Settings)

When hundreds of files are queued, the default 60s timeout isn’t enough. Update your pgbackrest.conf to be more patient:

[global]
archive-timeout=1200
protocol-timeout=1200
process-max=2

Step 2: Remove the Ghost File from the Repo

Go to your Backup Server (Repo). Find the specific WAL file that pgBackRest is complaining about and remove it. Don’t worry — the healthy original is still on your database server.

# On the Repo Server
find /var/lib/pgbackrest/archive -name &amp;amp;quot;0000006700003EE500000000*&amp;amp;quot; -exec rm -v {} \;


# Or step by step

find  -name &amp;amp;quot;0000006700003EE500000000*&amp;amp;quot;

rm ./prod_backup/13-1/0000006700003EE5/0000006700003EE500000000*

Step 3: Jumpstart the Engine

Once the corrupted copy is gone, tell pgBackRest to resume and force a WAL switch to wake up the archiver:

# On the Database Server (Optinal)
pgbackrest --stanza=your_stanza start # it can be start if your stanza stop
psql -c &amp;amp;quot;SELECT pg_switch_wal();&amp;amp;quot;  # if required switch new wall

4. The Result: Flow Restored

The moment the gatekeeper file is removed, pgBackRest sees the path is clear. It takes the healthy local version, uploads it, and then flushes the remaining 300+ files in minutes. Your disk space returns, and your PITR chain is perfectly intact.

Conclusion

In a live production environment, don’t be afraid to remove a corrupted copy to save the original flow. If you don’t remove the blockage at the end of the pipe, the pressure will eventually break the system