Netflix’s $100M Architecture

Notes from a database engineer working on high-performance, high-availability data systems on the dramatic end of vertical scaling and the…

Netflix’s $100M Architecture

Notes from a database engineer working on high-performance, high-availability data systems on the dramatic end of vertical scaling and the hidden architectural costs of horizontal distribution.

Hello.

I am a database engineer who has spent the last years designing and managing critical Postgres infrastructures for high-traffic, often handling thousands of transactions per second (TPS). During this time, I have repeatedly witnessed the dramatic moment when single server limitations bring massive systems to a grinding halt.

Today, we will discuss the biggest taboo in modern system architecture: Sharding (Horizontal Partitioning). While the idea seems simple, why does the sharding operation turn into a multi-hundred-million-dollar architectural gamble for large technology companies? Why do they risk the very heart of their systems?

This article will not just discuss what we do, but why large-scale systems make such painful decisions, why selecting the sharding key is an engineering and business problem not just a mathematical one and how a single bad decision can cripple an entire system.

1. The Final Curtain of Vertical Scaling: Why It Fails

For a growing company, the first and easiest solution is always Vertical Scaling: moving to a bigger machine. However, this path eventually leads to a technical and economic cliff.

Technical Limits and the NUMA Architecture: Even the latest servers no longer view memory and CPU as a single monolithic block. In Non-Uniform Memory Access (NUMA) architecture, a CPU accessing memory that does not directly belong to it causes latency. In a massive single server, this NUMA-related contention between database processes prevents expensive hardware from delivering the expected performance. In PostgreSQL, performance does not scale linearly with the increase in cores and memory.
Licensing Costs (The Oracle Legacy): Especially in migrations from Oracle to Postgres, the old architecture’s per-core licensing model forced engineers toward more distributed, less core-intensive systems. This was the only way to escape hundreds of millions of dollars in licensing expenses.

When vertical scaling ceases to be a solution, Horizontal Scaling (Sharding) becomes the only option: distributing the load across hundreds of smaller, cheaper, and more manageable servers.

2. The Sharding Key: A Business Riddle, Not Just Mathematics

Sharding means partitioning data. However, the criterion you choose to partition by will determine your architecture’s fate. This decision is called the Sharding (Distribution) Key.

Choosing the wrong key renders the entire Sharding operation, even if technically executed flawlessly, functionally useless.

Real-World Case Study: The Drama of Hot Shards

Consider a large e-commerce platform that chose to partition its data based on geographical regions (Shard Key: region_id).

Goal: To ensure data localization and isolate performance regionally.
Result: A major marketing campaign launched, and the traffic load for a single, highly-populated region (the Hot Shard) instantly surpassed the combined traffic of all other shards. While this Hot Shard was receiving 50,000 TPS, the remaining 99 shards were running at 100 TPS.
Cost: The entire system, despite having 100 times the capacity, collapsed due to that single slow Shard. Immense financial and engineering resources were invested in the database infrastructure, but one architectural decision invalidated the entire investment.

Engineering Lesson: The sharding key must be based on future workload predictions and uniform distribution, not just the current data distribution. A proper key must have high cardinality and the potential to evenly distribute access and growth (e.g., a hash of the customer_id or a tenant_id).

3. Distributed Transactions: Losing the Power of SQL

Sharding turns tasks that were simple within a single monolithic database ACID compliant transactions and complex JOINs into a nightmare.

The Distributed JOIN Problem

In a Sharded architecture, the work you once handled with a single SQL JOIN must now be done at the application layer (Application Layer Join):

The application queries Shard A to fetch a list of necessary IDs.
The application queries Shards B (and C, D, E) to retrieve the data matching those IDs.
The application joins this data in its own memory and returns the result.

This cascading operation significantly increases latency and becomes highly susceptible to network delays. Most complex Stored Procedures and Trigger logic in Postgres are incompatible with sharding and must be migrated back to application code. This represents an enormous development and testing cost.

Two-Phase Commit (2PC) Failures

When performing an ACID-compliant transaction (like a money transfer) across different Shards, Distributed Transaction Management is required. The 2PC protocol ensures all Shards either successfully commit the transaction or roll it back entirely.

However, 2PC is notoriously slow, and even the smallest failure in one Shard can cause the entire transaction to lock up (Deadlock) or leave the system in an inconsistent state. In high-traffic systems, 2PC is avoided. Instead, engineers adopt complex application-layer design patterns like the Saga Pattern or rely on Eventual Consistency, which introduces new levels of complexity.

4. PostgreSQL’s Power: Distributed Configuration and Automation

Netflix, after migrating to AWS, developed automation tools like Ansible and Spinnaker to manage distributed systems. A Sharded PostgreSQL architecture also demands the same automation discipline.

While extensions like Citus Data transform PostgreSQL into a logical distributed system, the real challenge is consistently managing the configuration, backup, and updates of hundreds of Worker Nodes (Shards).

Orchestration with Ansible: Managing the configuration of every Shard (e.g., Postgresql.conf settings, users, replication) in an idempotent manner (where the same command run repeatedly yields the same result without unintended side effects) is mandatory for keeping this complex cluster coherent.

Conclusion: Sharding is an Act of Architectural Will

The transition to Sharding for Netflix and other tech giants was a necessity. This operation, however, fundamentally changes not just the database technology, but also the application code, development processes, and monitoring methodology.

Before you begin sharding:

Push Vertical Limits: Ensure you have exhausted all solutions that support vertical scaling, including advanced PostgreSQL Tuning (AutoVacuum, Shared Buffers etc.), extensive Read Replicas, and PGBouncer connection pooling.
The Sharding Key: This is the single most important and irreversible decision for your engineering team. Never make this decision without deeply analyzing your business requirements and growth projections.
Automation: You cannot manage hundreds of Shards manually. Automation (Ansible, Terraform) and advanced monitoring (with custom metrics in Prometheus/Grafana) to detect anomalous traffic increases on individual Shards are the insurance policy for this architecture’s survival.

Sharding is the final exam for high-traffic systems. Success requires not just writing code, but holistically understanding the economic, architectural, and operational costs involved.

References

To support the claims regarding architectural challenges and major industry shifts, the following publicly available resources and widely recognized concepts were referenced:

The Netflix Migration Story: The historical context of Netflix moving away from monolithic data centers and Oracle to AWS and distributed systems (primarily Cassandra, eventually incorporating different data stores like CockroachDB and others) is extensively documented in their early blog posts and AWS re:Invent presentations.
PostgreSQL and NUMA Architecture: The performance implications of NUMA for heavily scaled Postgres instances are a well-documented concern in the database performance community.
Citus Data Technology: The core concept of extending PostgreSQL for horizontal scaling.
Distributed Systems Patterns: Concepts like the Saga Pattern and Eventual Consistency are standard in high-availability, microservices-based architectures.
The Economics of Database Licensing: The financial pressure imposed by traditional database vendors (like Oracle) is a major driving force behind the adoption of open-source and distributed databases.