PostgreSQL Troubleshooting Cheat Sheet

A practical, copy-paste friendly diagnostic guide for DBAs, SREs, and DevOps Engineers.

PostgreSQL Troubleshooting Cheat Sheet

A practical, copy-paste friendly diagnostic guide for DBAs, SREs, and DevOps Engineers.

Part 1 The Triage: What is happening right now?

Check Active vs. Idle Connections
Find Queries Running Longer Than 5 Minutes
Count Connections by User and Database

Part 2 The Traffic Jam: Locks & Blocking

Find Exactly Who is Blocking Whom (The Golden Query)
Cancel a Stuck Query Safely (pg_cancel_backend)
Kill a Connection Completely (pg_terminate_backend)

Part 3 The Slowpokes: Query Performance

Find the Most Frequently Executed Queries
Find Queries Consuming the Most Total Time
Detect Missing Indexes (Sequential Scans vs. Index Scans)

Part 4 The Memory Game: Cache Efficiency

Calculate Buffer Cache Hit Ratio
Calculate Index Hit Ratio

Part 5 The Disk Eaters: Space & Bloat

List the Top 10 Largest Tables (Including Indexes)
List the Top 10 Largest Indexes
Detect Table & Index Bloat (Wasted Space)

Part 6 The Janitor: Autovacuum Health

Check the Last vacuum and analyze Time for All Tables
Find Tables with the Most Dead Tuples

Part 7 The Copycats: Replication & WAL Logs

Check Replication Lag (Bytes and Seconds)
Identify Disconnected or Lagging Read-Replicas

Part 8 The Nuclear Options (Danger Zone)

Kill ALL Connections to a Specific Database
Reset Database Statistics (pg_stat_reset)

Part 1 The Triage: What is happening right now?

When the pager goes off and the database is screaming, your first instinct shouldn’t be to restart the server. Your first step is Triage. You need to understand what the database is doing right at this exact second.

All of the queries in this section rely on pg_stat_activity, which is essentially the task manager (or htop) of PostgreSQL.

1. Check Active vs. Idle Connections

Before you look at specific queries, you need a bird’s-eye view of your connection pool. Are there too many active queries, or are applications just opening connections and leaving them hanging?

SELECT state, count(*) 
FROM pg_stat_activity 
GROUP BY state 
ORDER BY count DESC;

How to read the results:

active: Queries currently executing using CPU/Disk. If this number matches your CPU core count, your server is working hard. If it's much higher, your CPU is likely bottlenecking.
idle: The connection is open, but doing nothing. This is normal if you use a connection pooler like PgBouncer.
idle in transaction: The application started a transaction (BEGIN), did some work, but never committed (COMMIT) or rolled back (ROLLBACK). These hold locks, prevent vacuuming, and can bring your database to a halt.

2. Find Queries Running Longer Than 5 Minutes

A query that takes milliseconds in staging can take hours in production if a table grows or an index is missing. This query acts as a radar for the slowest queries currently choking your system.

SELECT pid, 
       usename, 
       datname, 
       now() - query_start AS duration, 
       query 
FROM pg_stat_activity 
WHERE state = &amp;amp;#x27;active&amp;amp;#x27; 
  AND now() - query_start &amp;amp;gt; interval &amp;amp;#x27;5 minutes&amp;amp;#x27; 
ORDER BY duration DESC;

Why this matters:

You can change '5 minutes' to '1 minute' or '30 seconds' depending on how aggressive your latency requirements are.
Take note of the pid (Process ID). If this query is destroying your database performance, you will need this pid in Part 2 to kill it.

Pro Tip: If you see autovacuum or backup in the query column, don't panic. These are background maintenance tasks and are usually safe to let run, unless they are aggressively locking a table you need right now.

3. Count Connections by User and Database

If you hit the dreaded FATAL: sorry, too many clients already error, you need to know who is spamming the database. This query tells you exactly which user and which database are eating up your connection limits.

SELECT datname AS database_name, 
       usename AS user_name, 
       count(*) AS total_connections
FROM pg_stat_activity 
GROUP BY datname, usename 
ORDER BY total_connections DESC;

Actionable Advice:

Compare the top number here against your max_connections setting (default is usually 100).
If a single microservice (e.g., user: payment_service) has 400 connections and is crashing the DB, you know exactly which dev team to call to fix their connection pooling

Part 2 The Traffic Jam: Locks & Blocking

A single uncommitted transaction or a massive UPDATE on a heavily used table can cause a massive traffic jam. In PostgreSQL, when one query holds a lock, other queries queue up behind it. If you don't clear the jam, your application will quickly run out of available connections and crash.

Here is how you find the culprit and safely clear the road.

1. Find Exactly Who is Blocking Whom (The Golden Query)

Historically, finding blocking queries required a massive, ugly JOIN between pg_locks and pg_stat_activity. Fortunately, modern PostgreSQL (9.6+) has a built-in function that makes this incredibly easy.

This is the Golden Query. Run this when your database feels completely frozen.

SELECT pid AS blocked_pid,
       usename AS blocked_user,
       pg_blocking_pids(pid) AS blocking_pids,
       now() - query_start AS blocked_duration,
       query AS blocked_query
FROM pg_stat_activity
WHERE pg_blocking_pids(pid)::text != &amp;amp;#x27;{}&amp;amp;#x27;;

How to read the results:

blocked_pid: The victim. This is the query waiting in line.
blocking_pids: The culprit(s). This is the array of Process IDs (PIDs) holding the lock. This is your target.
Look at the blocking_pids. If you see a single PID blocking 50 other queries, that single process is the root cause of your outage.

2. Cancel a Stuck Query Safely (`pg_cancel_backend`)

Once you identify a bad query (either taking too long from Part 1, or blocking others from the query above), you need to stop it. Always try to cancel before you terminate.

pg_cancel_backend acts like hitting Ctrl+C in the terminal. It stops the current query and rolls back the transaction, but it keeps the underlying connection alive so the application doesn't throw a harsh disconnection error.

-- Replace 12345 with the actual PID of the bad query
SELECT pg_cancel_backend(12345);

When to use this:

When a developer accidentally runs SELECT * FROM multi_billion_row_table without a LIMIT.
When a read-heavy analytical query is consuming too much CPU.

3. Kill a Connection Completely (`pg_terminate_backend`)

Sometimes, a query simply refuses to cancel. Or worse, the connection is stuck in an idle in transaction state, meaning it's holding locks but actively doing absolutely nothing. In this case, you need the hammer.

pg_terminate_backend forcefully severs the connection between the application and the database.

-- Replace 12345 with the actual PID of the blocking process
SELECT pg_terminate_backend(12345);

Actionable Advice & Warnings:

Use this for stubborn idle in transaction queries that are blocking your production traffic.
Warning: The application will receive a fatal connection error. A well-written application with a connection pooler (like PgBouncer) will smoothly reconnect and retry, but poorly written scripts might crash. When the database is locked up, protecting the database takes priority.

Pro Tip: Need to kill multiple blocking queries at once? You can pass the pg_blocking_pids array directly into the terminate function using a subquery, but it's much safer to review the PIDs manually and terminate them one by one to avoid accidentally killing a critical background worker.

Part 3 The Slowpokes: Query Performance

Firefighting is over. You killed the blocking queries and the database is breathing again. Now it’s time for Root Cause Analysis. Why did it lock up? Why is CPU usage at 90%?

To answer these questions, you need the pg_stat_statements extension. It records execution statistics of all SQL statements executed.

Prerequisite: If you haven’t already, you must enable this extension by running CREATE EXTENSION pg_stat_statements; and adding it to your shared_preload_libraries in postgresql.conf.

1. Find the Most Frequently Executed Queries

Sometimes the database isn’t slow because of one terrible query; it’s slow because a decent query is being called 10,000 times per second. This is often the N+1 query problem in ORMs (like Hibernate or Prisma).

SELECT query, 
       calls, 
       round(total_exec_time::numeric, 2) AS total_time_ms, 
       round(mean_exec_time::numeric, 2) AS avg_time_ms
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;

How to read the results:

calls: How many times this exact query was executed.
avg_time_ms: How long it takes on average.
Actionable Advice: If a simple SELECT has millions of calls and a very low average execution time, the database is doing fine, but your application is hammering it. This is a perfect candidate for application-level caching (e.g., Redis or Memcached).

2. Find Queries Consuming the Most Total Time

This is the most important metric for overall system performance. A query that takes 1 second to run but is called 100,000 times hurts the server much more than a 10-second reporting query called only once a day.

SELECT query, 
       calls, 
       round(total_exec_time::numeric, 2) AS total_time_ms, 
       round((100 * total_exec_time / sum(total_exec_time::numeric) OVER ())::numeric, 2) AS percentage_of_overall_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Why this matters:

percentage_of_overall_time: This column is pure gold. If the top query takes up 45% of your total database execution time, optimizing this single query will cut your server load almost in half. * Take the query text from the result, run an EXPLAIN ANALYZE on it, and start tuning.

3. Detect Missing Indexes (Sequential Scans vs. Index Scans)

When a table doesn’t have an index for a specific WHERE clause, PostgreSQL has to read the entire table from top to bottom. This is called a Sequential Scan (Seq Scan). On a 50-million-row table, a Seq Scan will destroy your I/O performance.

This query finds the tables that are forcing the database to work the hardest.

SELECT relname AS table_name,
       seq_scan,
       seq_tup_read AS rows_read_via_seq_scan,
       idx_scan,
       idx_tup_fetch AS rows_fetched_via_index,
       round(100.0 * idx_scan / nullif(idx_scan + seq_scan, 0), 2) AS index_usage_percent
FROM pg_stat_user_tables
WHERE seq_scan &amp;amp;gt; 0
ORDER BY seq_tup_read DESC
LIMIT 10;

Actionable Advice & Warnings:

rows_read_via_seq_scan: If this number is in the millions or billions, PostgreSQL is desperately scanning disks to find data.
index_usage_percent: For large tables, you want this number to be above 95%. If it's low, you are likely missing an index on a frequently queried column.
Warning: Small tables (e.g., a status_codes table with 50 rows) will almost always use sequential scans because loading the index into memory is slower than just reading the 50 rows. Ignore small tables; focus on the massive ones.

Part 4 The Memory Game: Cache Efficiency

Database performance boils down to one simple rule: Memory is fast, and disk is slow. PostgreSQL tries to keep frequently accessed data and indexes in RAM (specifically in shared_buffers). When a query requests data, PostgreSQL first checks the memory. If it finds it, that's a Hit. If it doesn't, it has to fetch it from the physical disk, which is a Read. Too many disk reads will bring even the most powerful NVMe servers to their knees.

Here is how you measure if your database has enough RAM to do its job.

1. Calculate Buffer Cache Hit Ratio

This query gives you the overall health of your database memory. It calculates the percentage of time PostgreSQL found what it was looking for in RAM instead of going to the disk.

SELECT 
    datname AS database_name,
    sum(blks_hit) AS memory_hits,
    sum(blks_read) AS disk_reads,
    round(sum(blks_hit) * 100.0 / nullif(sum(blks_hit + blks_read), 0), 2) AS cache_hit_ratio_percent
FROM pg_stat_database
GROUP BY datname
ORDER BY cache_hit_ratio_percent DESC;

How to read the results:

> 99%: Excellent. Your database is serving almost everything directly from memory.
95% - 98%: Good for mixed workloads, but keep an eye on it.
< 90%: Your database is constantly hitting the disk. This is called disk thrashing.

Actionable Advice: If your cache hit ratio is low, you have two choices:

Optimize your queries so they read less data (Go back to Part 3).
Throw hardware at the problem (Increase RAM and tweak your shared_buffers setting in postgresql.conf).

2. Calculate Index Hit Ratio

Having an index is great, but if that index is too large to fit into RAM, PostgreSQL still has to read the index from the disk, defeating much of its purpose. This query shows you exactly which indexes are missing the cache.

SELECT 
    relname AS table_name,
    indexrelname AS index_name,
    idx_blks_hit AS memory_hits,
    idx_blks_read AS disk_reads,
    round(100.0 * idx_blks_hit / nullif(idx_blks_hit + idx_blks_read, 0), 2) AS index_hit_ratio_percent
FROM pg_statio_user_indexes
WHERE idx_blks_read &amp;amp;gt; 0
ORDER BY disk_reads DESC
LIMIT 10;

Why this matters:

The index_hit_ratio_percent should also be extremely close to 99%.
Look at the disk_reads column. The indexes at the top of this list are the ones forcing PostgreSQL to read from the disk.

Pro Tip: If a specific index has a terrible hit ratio, check its size. If you have a 50GB index but only 32GB of total server RAM, that index will never fully fit in memory. Consider dropping unused indexes, partitioning the table, or using partial indexes (e.g., CREATE INDEX ON users (created_at) WHERE active = true;) to keep the index size small and memory-friendly.

Part 5 The Disk Eaters: Space & Bloat

PostgreSQL uses a system called MVCC (Multi-Version Concurrency Control). When you UPDATE or DELETE a row, PostgreSQL doesn't actually delete the old data immediately. Instead, it creates a new version of the row and marks the old one as a dead tuple. Over time, if these dead rows aren't cleaned up, your tables and indexes will swell up like a balloon. This is called Bloat.

When your disk usage alerts start screaming, these are the queries you need to find out exactly what is eating your storage.

1. List the Top 10 Largest Tables (Including Indexes)

Before you can free up space, you need to know where the space went. This query breaks down the size of the table data versus the size of the indexes attached to it.

SELECT 
    relname AS table_name,
    pg_size_pretty(pg_relation_size(relid)) AS table_data_size,
    pg_size_pretty(pg_indexes_size(relid)) AS index_size,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;

How to read the results:

table_data_size: The actual raw data.
index_size: The total size of all indexes on this table.
Actionable Advice: If your index_size is significantly larger than your table_data_size, you likely have redundant, overlapping, or unused indexes. Every index takes up disk space and slows down INSERT/UPDATE operations. Drop the ones you don't need!

2. List the Top 10 Largest Indexes

Sometimes, a single massive index is the culprit. This query isolates the indexes so you can target them directly.

SELECT 
    indexrelname AS index_name,
    relname AS table_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;

Why this matters:

A 100GB index isn’t just a storage problem; it’s a memory problem (as we saw in Part 4). If an index is huge, consider replacing it with a Partial Index (e.g., indexing only status = 'active') to drastically reduce its footprint.

3. Detect Table Bloat (Wasted Space via Dead Tuples)

Getting exact byte-for-byte bloat calculations requires complex scripts or external extensions (like pgstattuple). However, in an emergency, you just need a quick proxy to see which tables are suffering from update/delete heavy workloads. This query calculates the ratio of dead rows to live rows.

SELECT 
    relname AS table_name,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_row_ratio_percent
FROM pg_stat_user_tables
WHERE n_dead_tup &amp;amp;gt; 0
ORDER BY dead_row_ratio_percent DESC
LIMIT 10;

How to fix the bloat:

dead_row_ratio_percent: If this number is high (e.g., > 20% on a large table), you have a severe bloat problem. The table is physically much larger on the disk than it needs to be.
Standard VACUUM: Marks dead rows as available for future inserts, but does not return the space to the operating system.
VACUUM FULL: Rebuilds the table completely and returns space to the OS.

PRO TIP (DANGER): Never run VACUUM FULL on a production table during peak hours. It requires an ACCESS EXCLUSIVE lock, meaning your application will be completely blocked from reading or writing to that table until it finishes (which could take hours). Instead, use the pg_repack extension. It removes bloat and reclaims disk space dynamically without locking the table

Part 6 The Janitor: Autovacuum Health

PostgreSQL is designed to clean up after itself. The background process responsible for this is called Autovacuum. It has two main jobs:

Vacuum: Clean up dead rows (tuples) so the space can be reused.
Analyze: Gather statistics about the data distribution so the Query Planner can choose the fastest execution paths (like deciding whether to use an index or not).

If Autovacuum is misconfigured or blocked, your database will experience severe bloat (Part 5) and terrible query performance (Part 3). Here is how you check if your janitor is actually doing its job.

1. Check the Last `vacuum` and `analyze` Time for All Tables

Is Autovacuum actually visiting your most active tables? This query tells you exactly when each table was last cleaned or analyzed, either automatically or manually.

SELECT 
    relname AS table_name,
    last_autovacuum,
    last_autoanalyze,
    last_vacuum,
    last_analyze
FROM pg_stat_user_tables
WHERE last_autovacuum IS NOT NULL 
   OR last_vacuum IS NOT NULL
ORDER BY last_autovacuum DESC NULLS LAST
LIMIT 15;

How to read the results:

Look for your most heavily updated/deleted tables. If last_autovacuum is several days ago (or NULL), you have a serious problem. The table is accumulating trash.
last_autoanalyze is equally important. If data changes constantly but the table isn't being analyzed, the database will start making terrible decisions, like ignoring a perfectly good index.

PRO TIP: Why does Autovacuum stop working? Most of the time, it’s blocked by a Long-Running Transaction or an Abandoned Replication Slot. Autovacuum cannot clean up a dead row if an ancient, uncommitted transaction (from Part 1) might still need to see it. Kill the stuck transaction, and Autovacuum will magically wake up.

2. Find Tables with the Most Dead Tuples

While Part 5 looked at dead tuples to calculate wasted space percentage, this query looks at the sheer volume of dead rows to see where Autovacuum is struggling to keep up with your application’s write traffic.

SELECT 
    relname AS table_name,
    n_dead_tup AS dead_tuples,
    n_live_tup AS live_tuples,
    round(100.0 * n_dead_tup / nullif(n_live_tup, 0), 2) AS dead_ratio_pct,
    autovacuum_count
FROM pg_stat_user_tables
WHERE n_dead_tup &amp;amp;gt; 10000
ORDER BY n_dead_tup DESC
LIMIT 10;

Actionable Advice:

dead_tuples: If you see a table with millions of dead tuples and a low autovacuum_count, your default Autovacuum settings are too lazy for that specific table.
How to fix it: You don’t need to change the global settings. You can make Autovacuum more aggressive for just that one table by lowering the scale factor threshold:

ALTER TABLE your_heavily_updated_table  
SET (autovacuum_vacuum_scale_factor = 0.05); -- Trigger vacuum when 5% of rows change (default

Part 7 The Copycats: Replication & WAL Logs

In modern architectures, you rarely run a single PostgreSQL instance. You usually have a Primary (handling writes) and several Read-Replicas (handling reads). They stay in sync by constantly streaming WAL (Write-Ahead Log) files.

When replication breaks, two terrible things happen:

Users hitting the Read-Replicas start seeing stale, outdated data.
If the replica disconnects but leaves its Replication Slot open, the Primary will save all WAL files infinitely until its disk hits 100% and crashes.

Run these queries on your Primary server to check the health of your clones.

1. Check Replication Lag (Bytes and Seconds)

This query gives you a real-time dashboard of all currently connected replicas. It tells you exactly how far behind they are in both bytes (how much data is waiting to be written) and time.

SELECT 
    application_name,
    client_addr AS replica_ip,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS byte_lag,
    replay_lag AS time_lag
FROM pg_stat_replication;

How to read the results:

state: Should be streaming. If it says catchup, the replica fell behind and is furiously trying to download missing WAL files to catch up to the Primary.
byte_lag: If this number is growing into the Gigabytes (GB), your replica's disk I/O or network might be too slow to handle the write volume of the Primary.
time_lag: How stale is the data? If this says 00:05:00, a user creating an account on the Primary won't be able to log in via the Read-Replica for 5 minutes.

2. Identify Disconnected or Lagging Read-Replicas (The Silent Killer)

If a replica physically crashes or loses network connectivity, it will disappear from the pg_stat_replication query above. You might think everything is fine, but it’s not.

If you use Replication Slots (which you should), the Primary will hold onto every WAL file until that disconnected replica comes back. This query reveals the hidden danger.

SELECT 
    slot_name,
    plugin,
    slot_type,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal_size_on_disk
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

Actionable Advice & Warnings:

active: If this is false, the replica is dead or disconnected.
retained_wal_size_on_disk: Look at this column very carefully. If it shows 50GB, 100GB, or more, the disconnected replica is eating your Primary's disk space.
The Fix: If the replica is dead and never coming back, you must drop the slot immediately to allow the Primary to delete the old WAL files and free up disk space:

-- Replace &amp;amp;#x27;my_dead_replica_slot&amp;amp;#x27; with the actual slot_name 

SELECT pg_drop_replication_slot(&amp;amp;#x27;my_dead_replica_slot&amp;amp;#x27;);

PRO TIP: A heavily lagging replica can also cause query cancellations on the replica itself (the dreaded canceling statement due to conflict with recovery error). If you see this, tweak max_standby_streaming_delay on the replica to give read queries a bit more time to finish before the replication process kills them.

Part 8 The Nuclear Options (Danger Zone)

Sometimes, gentle tuning and polite cancellations aren’t enough. You are doing a major migration, restoring a backup, or the monitoring data is completely skewed by a past event. You need the heavy artillery.

Warning: The queries in this section are destructive to active sessions and monitoring baselines. Use them only when you are absolutely sure of what you are doing.

1. Kill ALL Connections to a Specific Database

Have you ever tried to run DROP DATABASE my_database; only to be greeted with the infuriating error: database is being accessed by other users?

You don’t need to restart the whole PostgreSQL service to fix this. You can forcefully kick everyone out of a specific database with one query.

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = &amp;amp;#x27;your_target_database&amp;amp;#x27;
  AND pid &amp;amp;lt;&amp;amp;gt; pg_backend_pid(); -- Prevents you from killing your own session!

When to use this:

Right before dropping or renaming a database.
During an emergency maintenance window where you need exclusive access to the database to run a complex migration.
Why pg_backend_pid() is crucial: If you forget AND pid <> pg_backend_pid(), you will terminate your own connection before the query finishes!

2. Reset Database Statistics (`pg_stat_reset`)

In Part 3, we used pg_stat_statements to find the slowest queries. But what if you just released a massive performance patch yesterday? The statistics from last week will still pollute your results, making it impossible to see if your patch actually worked.

You need to clear the scoreboard.

-- Resets internal PostgreSQL statistics (cache hits, dead tuples counters, etc.)
SELECT pg_stat_reset();


-- Resets the pg_stat_statements extension (clears the slow query history)
SELECT pg_stat_statements_reset();

Actionable Advice:

Running this does not delete any actual database data. It only resets the monitoring counters back to zero.
Make it a habit to run pg_stat_statements_reset() immediately after deploying a major database indexing strategy or query optimization. This gives you a clean baseline to measure your success.

Conclusion

PostgreSQL is often treated like a black box by developers. When it slows down, the instinct is to panic, restart the server, or blindly throw more RAM at it. But as you’ve seen in this cheat sheet, PostgreSQL is incredibly transparent it just waits for you to ask the right questions.