Logo ← PostgreSQL Blog

A Database Engineer’s War Notes

Introduction: The Silence Before the Storm

A Database Engineer’s War Notes

Introduction: The Silence Before the Storm

Let’s be honest: the scariest nightmare for a Database Engineer or Backend Developer isn’t a system crash. Because if the system crashes, the logs are clear, you restart it, and the error message screams, “I’m right here!” You solve it.

The real nightmare is the silence.

You’ve grabbed your morning coffee and are looking at the Grafana dashboard. CPU is at 15%, RAM is spacious, and there’s no spike in Disk I/O. Everything looks “rosy.” But the Mail, Slack, or Teams channels are blowing up: “Client is getting timeouts!”, “The button just keeps spinning!”, “ The Admin panel is locked!”

That is the moment you realize your enemy isn’t hardware insufficiency. Your enemy is the most insidious, misunderstood, and painful mechanism in the database world: Locking and Blocking.

In this article, we are going to pop the hood of PostgreSQL and dive deep into the engine. We won’t just ask “What is a lock?”; we will look at how to coolly find the “Killer” (Blocking Session) during those chaotic Production moments, the limits of the MVCC architecture, and which “innocent” mistakes we make in software architecture that strangle the database.

If you’re ready, open your terminals. We’re starting.

Part 1: The Great Misconception and the MVCC Reality

There is a misconception that newcomers to the PostgreSQL world, and even experienced developers who have been writing code for years, fall into: “PostgreSQL uses MVCC, so readers don’t block writers, and writers don’t block readers. No problem!”

On paper, this is true. Thanks to Multi-Version Concurrency Control (MVCC), when data is being updated (UPDATE), PostgreSQL continues to serve the old version (Snapshot) of that data to those trying to read it. It’s similar to Oracle’s "Undo Tablespace" logic but works differently (versioning within the Heap).

However… (In databases, there is always a “however”).

MVCC only handles row-based conflicts at the DML (Data Manipulation Language — SELECT, UPDATE, INSERT) level. But when DDL (Data Definition Language) commands that change the table structure or incorrect transaction management enters the picture, MVCC steps aside, and the Lock Manager takes the stage.

That is the moment you see thousands of connections switching to “Waiting” status in seconds and hitting the max_connections wall.

Part 2: Lock Hierarchy (Who Beats Whom?)

In PostgreSQL, locks are managed by a “Conflict Matrix.” Think of this like a chessboard; some pieces don’t threaten each other, while others can’t even stand next to one another.

Here are the critical scenarios we engineers need to know by heart:

1. The Innocent-Looking Killer: CREATE INDEX You sent a CREATE INDEX command to a table with millions of rows on a live system. What happens? PostgreSQL locks that table in SHARE mode until the index is created. This means: "People can read (SELECT), but NO ONE can write (INSERT/UPDATE/DELETE)." If creating the index takes 10 minutes, your application goes into "Read-Only" mode for 10 minutes. Customers can’t order, users can’t register. Solution: Always use CREATE INDEX CONCURRENTLY. It takes a bit longer and consumes more CPU, but it doesn't lock the table.

2. The Nuclear Button: ALTER TABLE You want to add a column (ADD COLUMN) or change a data type. PostgreSQL requires an ACCESS EXCLUSIVE LOCK for this. Meaning: "Until I finish my job, NO ONE can touch this table—neither reading nor writing." If your table is large and this operation takes time, the system completely stops. Even worse, this command waits for all existing operations on the table to finish to run (Lock Queue). If there is a long report query running at that moment, the ALTER command waits for it. Because ALTER is waiting, the millisecond-long SELECT queries coming behind it also start waiting for ALTER. Result: A chain-reaction pile-up.

Part 3: The Most Insidious Enemy: “Idle in Transaction”

My Backend Developer friend reading this, I’m talking to you. (And to myself, as an engineer who made this mistake back in the day).

If 40% of database performance issues are bad SQL, the other 40% is “Transaction Management” errors.

Imagine this scenario:

  1. The application s tarts a transaction with BEGIN;.
  2. It says UPDATE products SET stock = stock - 1 WHERE id = 100;. (ID 100 is now locked).

Then what happens?

  1. The application sends an HTTP request to a bank API to receive payment. Or it tries to send a confirmation email to the user.

Here lies the disaster. If the Bank API responds 5 seconds late, that database lock stays open for 5 seconds. During that time, 500 other customers trying to view or buy that product wait for the response of your HTTP request. Database resources are exhausted, and the connection pool fills up.

If you see “idle in transaction” in the state column of the pg_stat_activity table, your code has logic that holds the database hostage while it goes out for coffee.

Golden Rule: NEVER access the outside world (API, Disk I/O, Mail, etc.) inside transaction blocks. Do your database work and get out immediately with COMMIT.

Part 4: Detective Tools — “Wait-For Graph” Analysis

When the system locks up, panicking and randomly restarting services is amateurish. The professional approach is to find the “Root Cause” with surgical precision.

A simple SELECT * FROM pg_stat_activity WHERE state = 'active' query won't suffice. Because that query only shows you the screaming victims. We need the silent killer.

The query below is the one I’ve kept like a “Swiss Army Knife” for years. It joins the pg_locks table with pg_stat_activity to generate a Blocking Tree.

Write this down somewhere; one day it will save your life:

SELECT
        -- BLOCKED (Victim) Process Info
        bl.pid                     AS blocked_pid      ,
        ba.usename                 AS blocked_user     ,
        ba.client_addr             AS blocked_client_ip,
        ba.query                   AS blocked_query    ,
        age(now(), ba.query_start) AS blocked_duration ,
        -- BLOCKING (Culprit) Process Info
        '>>>> BLOCKED BY >>>>'        AS separator     ,
        bl2.pid                       AS blocking_pid  ,
        ba2.usename                   AS blocking_user ,
        ba2.query                     AS blocking_query,
        ba2.state                     AS blocking_state, -- If this is 'idle in transaction', culprit found!
        age(now(), ba2.backend_start) AS blocking_tx_duration
FROM
        pg_catalog.pg_locks bl JOIN pg_catalog.pg_stat_activity ba ON ba.pid = bl.pid JOIN pg_catalog.pg_locks bl2 ON (bl.locktype = bl2.locktype
        AND bl.database IS NOT DISTINCT FROM bl2.database
        AND bl.relation IS NOT DISTINCT FROM bl2.relation
        AND bl.page IS NOT DISTINCT FROM bl2.page
        AND bl.tuple IS NOT DISTINCT FROM bl2.tuple
        AND bl.virtualxid IS NOT DISTINCT FROM bl2.virtualxid
        AND bl.transactionid IS NOT DISTINCT FROM bl2.transactionid
        AND bl.classid IS NOT DISTINCT FROM bl2.classid
        AND bl.objid IS NOT DISTINCT FROM bl2.objid
        AND bl.objsubid IS NOT DISTINCT FROM bl2.objsubid
        AND bl.pid != bl2.pid) JOIN pg_catalog.pg_stat_activity ba2 ON ba2.pid = bl2.pid
WHERE
        NOT bl.granted -- Waiting (Haven't got the lock yet)
AND     bl2.granted;   -- Holding the lock

When you run this query: If you see idle in transaction in the blocking_state section and the blocking_query is empty (or shows the last run query), know that there is a transaction left open on the application side. Killing that blocking_pid with the pg_terminate_backend(PID) command will unclog the blockage instantly like a drain opener.

Part 5: A Little-Known Danger: Foreign Keys and Locking

The topic that gives the biggest headache to those moving from Oracle to PostgreSQL: Foreign Key (FK) indexes.

Let’s say you have an “Orders” table and a connected “OrderItems” table. If you delete a row from the “Orders” table (DELETE), PostgreSQL has to go and check the "OrderItems" table to see if there is a record belonging to this order (Referential Integrity).

If there is no index on the order_id (Foreign Key) column in the "OrderItems" table, PostgreSQL might have to perform a Full Table Scan on "OrderItems" or hold too many row-level locks to perform this check.

In high-traffic systems, when deleting or updating records from the Parent table, if there is no index on the FK column in the Child table, you will experience serious locking issues (and Deadlocks). Advice: Absolutely create an index for every column you define as a Foreign Key. This improves both Join performance and reduces Locking issues.

Part 6: What is a Deadlock and How to Prevent It?

A Deadlock is two stubborn goats meeting on a bridge.

  • Transaction A: Locked Table X, wants Table Y.
  • Transaction B: Locked Table Y, wants Table X.

Since neither steps back, they wait forever. Fortunately, PostgreSQL has a Deadlock Detector mechanism. By default, it waits for 1 second (deadlock_timeout), realizes the situation, and terminates one transaction with an error saying, "One must die so the other can live."

Ways to Protect Against Deadlocks:

  • Standard Ordering: If multiple tables are to be updated in your application, always update them in the same order (e.g., first Master, then Detail table).
  • Short Transactions: The shorter the transaction, the lower the probability of conflict.
  • Use For Update: If you are going to read a row and then update it, declare your intention from the start by saying SELECT ... FOR UPDATE while reading. This prevents others from intervening.

Conclusion: Engineering is about Prevention

Locking issues are inherent to the nature of a database. The problem isn’t the existence of locks, but the failure to manage them.

Before finishing the article, make sure to check these 3 settings (postgresql.conf) for your Production environments:

  • lock_timeout: It is off (infinite) by default. Definitely set this to a value (e.g., 10s or 30s). Do not allow a query to lock the database forever. Let it fail, let it crash, but keep the system standing.
  • log_lock_waits = on: Log queries waiting for locks. This will give you the answer to the question, "At what hours are we experiencing bottlenecks?"
  • statement_timeout: Automatically terminate queries that take too long (e.g., exceeding 1 hour).

Remember; a good Database Engineer isn’t the one who puts out the fire fastest when it breaks out, but the one who prevents that fire from breaking out with their architectural decisions.

I hope this article ensures you look at your screen more calmly and confidently during the next “System is not responding” crisis.

See you in the next in-depth analysis, [Kemal Öz/PostgreSQL Engineer]