20TB+ PostgreSQL Architecture

A Deterministic Performance Blueprint from Infrastructure to Hardware

20TB+ PostgreSQL Architecture

A Deterministic Performance Blueprint from Infrastructure to Hardware

PostgreSQL works exceptionally well out of the box until it does not.

Once a database grows beyond the 20TB threshold, default PostgreSQL settings and general purpose operating system configurations stop delivering predictable behavior. At this scale, performance issues are rarely caused by SQL or indexing mistakes. They are caused by entropy in storage, memory, kernel behavior, and coordination layers.

This document presents a production hardened PostgreSQL architecture for RHEL-based systems, designed to achieve deterministic latency, data consistency, and long term operational sustainability.
The goal is not peak benchmark performance.
The goal is a system that behaves the same way under pressure, every time.

Storage Architecture
2.1 Physical Isolation as a First Principle
2.2 Primary Data Directory — pg_data
2.3 Write-Ahead Log — pg_wal
2.4 Temporary Files — pg_temp
2.5 Cluster Coordination Storage
RHEL Kernel Tuning
3.1 Virtual Memory and Swappiness
3.2 Dirty Page Write-Back Strategy
3.3 Transparent Huge Pages
Distributed Coordination and Failover Stability
Network Topology and Traffic Segregation
Operational Best Practices
6.1 LVM and Online Growth
6.2 CPU, Memory, and NUMA Pinning
Reducing Entropy at Scale
Next Steps and Validation Checklist

1. Storage Architecture: Physical Isolation Is Mandatory

In 20TB+ environments, logical separation is an illusion.
If two mount points share the same physical disk, they are competing regardless of how clean the directory layout looks.

Every critical PostgreSQL I/O path must be backed by dedicated physical resources.

/pg_data — Primary Data Directory

Capacity: 20TB+
Disk Type: NVMe backed RAID 10

This mount hosts PostgreSQL core data:

Tables
Indexes
System catalogs
Visibility maps and free space maps

RAID 10 on NVMe provides:

High sustained IOPS
Low and stable latency
Fault tolerance without write penalties

This layout ensures predictable behavior under mixed OLTP workloads, even during sustained write pressure or partial disk failures.

/pg_wal — Write-Ahead Log

Capacity: 500GB+
Disk Type: Dedicated ultra-low-latency NVMe

The WAL is PostgreSQL’s heartbeat.
Every transaction commit is gated by WAL flush latency.

Any delay here directly impacts:

Commit time
Replication lag
Failover duration

Placing WAL on a dedicated NVMe device eliminates I/O contention and keeps transaction throughput smooth and consistent under peak load.

/pg_temp — Temporary Files

Capacity: ~1TB
Disk Type: Dedicated SSD

Large sorts, hashes, and complex joins inevitably spill to disk.

Isolating temporary files:

Protects OLTP workloads from analytics-induced I/O spikes
Prevents reporting queries from saturating the main data volume

In mixed OLTP + analytics systems, this separation is not optional it is defensive architecture.

/coordination — Cluster Coordination Storage

Capacity: ~100GB
Disk Type: Ultra-low-latency SSD or NVMe

This disk is reserved for Patroni and etcd metadata and logs.

Latency at this layer directly affects:

Leader elections
Heartbeat stability
Failover correctness

Keeping coordination data isolated prevents cluster flapping and ensures clean, deterministic HA behavior.

2. RHEL Kernel Tuning: Forcing Predictable Behavior

Linux defaults are optimized for fairness and general workloads — not for large, stateful databases.

At this scale, the kernel must be forced into predictable behavior under sustained I/O pressure.

Virtual Memory and Swappiness

Default: vm.swappiness = 60
Recommended: vm.swappiness = 1

High swappiness allows the kernel to evict PostgreSQL buffer cache even when free memory exists, causing sudden and catastrophic latency spikes.

Setting swappiness to 1 keeps the active dataset in memory and eliminates swap-induced stalls.

Dirty Page Write-Back Control

On large-memory systems, uncontrolled dirty page accumulation can lead to massive, blocking flush operations.

Recommended settings:

vm.dirty_background_ratio = 3
vm.dirty_ratio = 10

This enforces early, continuous write-back instead of burst flushing, resulting in smoother I/O and more stable fsync latency.

Transparent Huge Pages

Setting: disabled

Transparent Huge Pages introduce unpredictable memory allocation latency and fragmentation in PostgreSQL workloads.

This is not a tuning preference.
THP must be disabled.

3. Distributed Coordination and Failover Stability

A common failure pattern in large clusters is not database failure — but coordination starvation.

If the primary data disk becomes fully saturated:

Heartbeats may not flush in time
The leader appears unresponsive
Unnecessary failovers are triggered

This is how healthy databases get demoted.

Mitigation strategy

Use a small, ultra-low-latency disk for coordination
Keep HA metadata completely isolated from data I/O pressure

Additionally:

Pre-allocate WAL files
Pre-allocate data files when possible
This avoids filesystem allocation overhead during peak transaction windows.

4. Network Topology: As Critical as Storage

At 20TB scale, the network is no longer just plumbing — it is a failure domain.

Traffic Segregation

Replication Plane: Dedicated 25GbE+ for streaming replication
Application Plane: Client connections only
Management Plane: SSH, monitoring, and HA traffic

On RHEL systems:

Use tuned-adm with network-latency or throughput-performance
Layer custom sysctl tuning on top

This separation prevents replication storms or application spikes from destabilizing cluster control traffic.

5. Operational Reality: LVM and Determinism

Use LVM Everywhere

Never mount raw disks
Online capacity expansion is inevitable
20TB becoming 30TB is not a question of if, but when

LVM is not overhead it is operational insurance.

Virtualized Environments

In VMware or KVM environments:

Pin CPUs
Pin memory
Respect NUMA boundaries

Noisy neighbors introduce silent latency variance that is almost impossible to debug after the fact.

6. Final Thoughts: Reducing Entropy, Not Chasing Speed

Designing PostgreSQL at 20TB+ scale is not about finding a magic configuration.

It is about:

Isolating I/O paths
Pinning memory behavior
Forcing deterministic kernel write-back
Eliminating shared failure domains

When entropy is reduced, operations shift from firefighting to system management.

7. Next Steps

Run fio benchmarks on each mount point independently
Simulate disk hang scenarios and observe Patroni behavior
Continuously monitor:
iowait
dirty_bytes
WAL fsync latency