20TB+ PostgreSQL Architecture
20TB+ PostgreSQL Architecture
A Deterministic Performance Blueprint from Infrastructure to Hardware
PostgreSQL works exceptionally well out of the box until it does not.
Once a database grows beyond the 20TB threshold, default PostgreSQL settings and general purpose operating system configurations stop delivering predictable behavior. At this scale, performance issues are rarely caused by SQL or indexing mistakes. They are caused by entropy in storage, memory, kernel behavior, and coordination layers.
This document presents a production hardened PostgreSQL architecture for RHEL-based systems, designed to achieve deterministic latency, data consistency, and long term operational sustainability.
The goal is not peak benchmark performance.
The goal is a system that behaves the same way under pressure, every time.
Table of Contents
- Storage Architecture
2.1 Physical Isolation as a First Principle
2.2 Primary Data Directory — pg_data
2.3 Write-Ahead Log — pg_wal
2.4 Temporary Files — pg_temp
2.5 Cluster Coordination Storage - RHEL Kernel Tuning
3.1 Virtual Memory and Swappiness
3.2 Dirty Page Write-Back Strategy
3.3 Transparent Huge Pages - Distributed Coordination and Failover Stability
- Network Topology and Traffic Segregation
- Operational Best Practices
6.1 LVM and Online Growth
6.2 CPU, Memory, and NUMA Pinning - Reducing Entropy at Scale
- Next Steps and Validation Checklist

1. Storage Architecture: Physical Isolation Is Mandatory
In 20TB+ environments, logical separation is an illusion.
If two mount points share the same physical disk, they are competing regardless of how clean the directory layout looks.
Every critical PostgreSQL I/O path must be backed by dedicated physical resources.
/pg_data — Primary Data Directory
- Capacity: 20TB+
- Disk Type: NVMe backed RAID 10
This mount hosts PostgreSQL core data:
- Tables
- Indexes
- System catalogs
- Visibility maps and free space maps
RAID 10 on NVMe provides:
- High sustained IOPS
- Low and stable latency
- Fault tolerance without write penalties
This layout ensures predictable behavior under mixed OLTP workloads, even during sustained write pressure or partial disk failures.
/pg_wal — Write-Ahead Log
- Capacity: 500GB+
- Disk Type: Dedicated ultra-low-latency NVMe
The WAL is PostgreSQL’s heartbeat.
Every transaction commit is gated by WAL flush latency.
Any delay here directly impacts:
- Commit time
- Replication lag
- Failover duration
Placing WAL on a dedicated NVMe device eliminates I/O contention and keeps transaction throughput smooth and consistent under peak load.
/pg_temp — Temporary Files
- Capacity: ~1TB
- Disk Type: Dedicated SSD
Large sorts, hashes, and complex joins inevitably spill to disk.
Isolating temporary files:
- Protects OLTP workloads from analytics-induced I/O spikes
- Prevents reporting queries from saturating the main data volume
In mixed OLTP + analytics systems, this separation is not optional it is defensive architecture.
/coordination — Cluster Coordination Storage
- Capacity: ~100GB
- Disk Type: Ultra-low-latency SSD or NVMe
This disk is reserved for Patroni and etcd metadata and logs.
Latency at this layer directly affects:
- Leader elections
- Heartbeat stability
- Failover correctness
Keeping coordination data isolated prevents cluster flapping and ensures clean, deterministic HA behavior.
2. RHEL Kernel Tuning: Forcing Predictable Behavior
Linux defaults are optimized for fairness and general workloads — not for large, stateful databases.
At this scale, the kernel must be forced into predictable behavior under sustained I/O pressure.
Virtual Memory and Swappiness
- Default: vm.swappiness = 60
- Recommended: vm.swappiness = 1
High swappiness allows the kernel to evict PostgreSQL buffer cache even when free memory exists, causing sudden and catastrophic latency spikes.
Setting swappiness to 1 keeps the active dataset in memory and eliminates swap-induced stalls.
Dirty Page Write-Back Control
On large-memory systems, uncontrolled dirty page accumulation can lead to massive, blocking flush operations.
Recommended settings:
- vm.dirty_background_ratio = 3
- vm.dirty_ratio = 10
This enforces early, continuous write-back instead of burst flushing, resulting in smoother I/O and more stable fsync latency.
Transparent Huge Pages
- Setting: disabled
Transparent Huge Pages introduce unpredictable memory allocation latency and fragmentation in PostgreSQL workloads.
This is not a tuning preference.
THP must be disabled.
3. Distributed Coordination and Failover Stability
A common failure pattern in large clusters is not database failure — but coordination starvation.
If the primary data disk becomes fully saturated:
- Heartbeats may not flush in time
- The leader appears unresponsive
- Unnecessary failovers are triggered
This is how healthy databases get demoted.
Mitigation strategy
- Use a small, ultra-low-latency disk for coordination
- Keep HA metadata completely isolated from data I/O pressure
Additionally:
- Pre-allocate WAL files
- Pre-allocate data files when possible
This avoids filesystem allocation overhead during peak transaction windows.
4. Network Topology: As Critical as Storage
At 20TB scale, the network is no longer just plumbing — it is a failure domain.
Traffic Segregation
- Replication Plane: Dedicated 25GbE+ for streaming replication
- Application Plane: Client connections only
- Management Plane: SSH, monitoring, and HA traffic
On RHEL systems:
- Use tuned-adm with network-latency or throughput-performance
- Layer custom sysctl tuning on top
This separation prevents replication storms or application spikes from destabilizing cluster control traffic.
5. Operational Reality: LVM and Determinism
Use LVM Everywhere
- Never mount raw disks
- Online capacity expansion is inevitable
- 20TB becoming 30TB is not a question of if, but when
LVM is not overhead it is operational insurance.
Virtualized Environments
In VMware or KVM environments:
- Pin CPUs
- Pin memory
- Respect NUMA boundaries
Noisy neighbors introduce silent latency variance that is almost impossible to debug after the fact.
6. Final Thoughts: Reducing Entropy, Not Chasing Speed
Designing PostgreSQL at 20TB+ scale is not about finding a magic configuration.
It is about:
- Isolating I/O paths
- Pinning memory behavior
- Forcing deterministic kernel write-back
- Eliminating shared failure domains
When entropy is reduced, operations shift from firefighting to system management.
7. Next Steps
- Run fio benchmarks on each mount point independently
- Simulate disk hang scenarios and observe Patroni behavior
- Continuously monitor:
- iowait
- dirty_bytes
- WAL fsync latency
← PostgreSQL Blog