In the previous article, you learned how caching speeds up systems. But behind every cache, there is a database. The database is where your data lives permanently.

Choosing the right database is one of the most important decisions in system design. It affects performance, scalability, and how easy your system is to maintain.

SQL Databases

SQL (Structured Query Language) databases store data in tables with rows and columns. They follow a fixed schema — you define the structure before inserting data.

Users table:
  +----+---------+-------------------+------------+
  | id | name    | email             | created_at |
  +----+---------+-------------------+------------+
  | 1  | Alex    | alex@example.com  | 2026-01-15 |
  | 2  | Sam     | sam@example.com   | 2026-02-20 |
  | 3  | Jordan  | jordan@example.com| 2026-03-10 |
  +----+---------+-------------------+------------+
  • PostgreSQL — the most feature-rich open-source SQL database. Supports JSON, full-text search, geospatial queries. Used by Instagram, Uber, Stripe.
  • MySQL — the most popular open-source SQL database. Simple, reliable, well-documented. Used by Facebook, Twitter, YouTube.
  • SQLite — a lightweight database stored in a single file. Great for mobile apps and embedded systems.
  • Microsoft SQL Server — enterprise SQL database. Used in corporate environments.

Strengths of SQL Databases

  • ACID compliance — guarantees data integrity (more on this below)
  • Complex queries — JOINs, subqueries, aggregations across multiple tables
  • Fixed schema — enforces data structure, prevents bad data
  • Mature ecosystem — decades of tooling, documentation, and expertise
  • Transactions — group multiple operations into one atomic unit

NoSQL Databases

NoSQL databases store data in flexible formats without a fixed schema. There are several types:

1. Document Databases

Store data as JSON-like documents. Each document can have a different structure.

MongoDB document:
{
  "_id": "user_001",
  "name": "Alex",
  "email": "alex@example.com",
  "address": {
    "city": "Berlin",
    "country": "Germany"
  },
  "hobbies": ["coding", "hiking", "photography"]
}

Popular: MongoDB, CouchDB, Amazon DocumentDB

Best for: content management, user profiles, product catalogs — data with varied structure.

2. Key-Value Databases

The simplest type. Store data as key-value pairs. Very fast for lookups by key.

Key: "session:abc123"   Value: {"user_id": 1, "expires": "2026-05-16T12:00:00"}
Key: "user:1:name"      Value: "Alex"
Key: "cache:popular"    Value: "[1, 5, 12, 8, 3]"

Popular: Redis, Amazon DynamoDB, etcd

Best for: caching, sessions, shopping carts, real-time leaderboards.

3. Column-Family Databases

Store data in columns instead of rows. Optimized for reading and writing large amounts of data.

Row key: "user_001"
  Column family "profile":  name="Alex", email="alex@example.com"
  Column family "activity": last_login="2026-05-16", login_count=150
  Column family "metrics":  posts=42, followers=1200

Popular: Apache Cassandra, HBase, Google Bigtable

Best for: time-series data, IoT data, analytics, write-heavy workloads with billions of rows.

4. Graph Databases

Store data as nodes and relationships. Optimized for queries about connections between data.

Graph structure:
  (Alex) --[FOLLOWS]--> (Sam)
  (Alex) --[FOLLOWS]--> (Jordan)
  (Sam)  --[FOLLOWS]--> (Jordan)
  (Jordan) --[LIKES]--> (Post: "System Design is Fun")

Popular: Neo4j, Amazon Neptune, ArangoDB

Best for: social networks, recommendation engines, fraud detection, knowledge graphs.

ACID Properties

ACID is a set of properties that guarantee database transactions are processed reliably. SQL databases are typically ACID-compliant. Most NoSQL databases sacrifice some ACID properties for performance and scalability.

Atomicity

A transaction is all-or-nothing. Either every operation in the transaction succeeds, or none of them do.

Transfer $100 from Alex to Sam:
  1. Deduct $100 from Alex's account
  2. Add $100 to Sam's account

Atomicity guarantees:
  EITHER both steps happen (success)
  OR neither step happens (failure — rollback)

You never end up with $100 deducted from Alex but not added to Sam.

Consistency

Every transaction brings the database from one valid state to another. Database rules (constraints, foreign keys) are always enforced.

Rule: account balance must be >= 0

Alex has $50. Transfer $100 to Sam.
  Step 1: Deduct $100 from Alex --> balance = -$50 (violates rule!)
  Transaction is rolled back. Database stays consistent.

Isolation

Concurrent transactions do not interfere with each other. Each transaction appears to execute in isolation, as if it were the only one running.

Two transactions running at the same time:
  Transaction A: Read Alex's balance ($500)
  Transaction B: Read Alex's balance ($500)
  Transaction A: Deduct $100 --> write $400
  Transaction B: Deduct $200 --> write $300

Without isolation: Alex's balance is $300 (lost Transaction A's update!)
With isolation: Transactions run one at a time. Final balance = $200 (correct)

Durability

Once a transaction is committed, it stays committed. Even if the server crashes, the data is not lost.

1. Transaction committed: Alex's balance = $400
2. Server crashes
3. Server restarts
4. Alex's balance is still $400 (data was written to disk)

SQL vs NoSQL — Decision Framework

The question is not “which is better?” but “which fits your use case?”

Choose SQL When:

  • Your data has relationships — users have orders, orders have items, items belong to categories. SQL JOINs handle this naturally.
  • You need ACID transactions — financial systems, inventory management, booking systems.
  • Your schema is well-defined — you know the structure of your data upfront.
  • You need complex queries — aggregations, reporting, analytics across multiple tables.

Choose NoSQL When:

  • Your data has no fixed schema — user-generated content, logs, IoT data.
  • You need extreme write throughput — billions of events per day (time-series, analytics).
  • You need horizontal scaling — NoSQL databases are designed to run across many machines.
  • Your access patterns are simple — mostly key-value lookups or document reads.
  • Low latency is critical — key-value stores like Redis return data in under 1 millisecond.

Common Combinations

Most real systems use multiple databases:

Typical e-commerce system:
  [PostgreSQL] — users, orders, payments (needs ACID)
  [Redis]      — sessions, caching, rate limiting
  [MongoDB]    — product catalog (flexible schema)
  [Elasticsearch] — search functionality

This is called polyglot persistence — using the right database for each job.

Database Replication

Replication means keeping copies of data on multiple machines. It improves availability (if one machine dies, others have the data) and read performance (distribute reads across copies).

Leader-Follower Replication

One database (the leader) handles all writes. Copies (followers or replicas) handle reads.

Leader-Follower:

  [App Server] --writes--> [Leader Database]
                                |
                                |--> replication --> [Follower 1] <--reads-- [App Server]
                                |--> replication --> [Follower 2] <--reads-- [App Server]
                                |--> replication --> [Follower 3] <--reads-- [App Server]

How it works:

  1. All writes go to the leader
  2. The leader sends changes to followers (replication log)
  3. Followers apply the changes to their own copy
  4. Read requests go to any follower

Pros: Simple. Scales reads easily by adding more followers. Cons: The leader is a single point of failure for writes. There is a delay (replication lag) between the leader writing data and followers receiving it.

Leader-Leader Replication

Multiple databases accept writes. Each leader replicates to the other leaders.

Leader-Leader:

  [App Server US] --writes--> [Leader US] <--replication--> [Leader EU] <--writes-- [App Server EU]

Pros: No single point of failure for writes. Users can write to the nearest datacenter. Cons: Write conflicts. What if two users update the same record on different leaders at the same time? You need conflict resolution logic.

Replication Lag

Replication is not instant. There is a delay between when data is written to the leader and when it appears on followers. This is called replication lag.

Timeline:
  T=0ms:   User updates profile on leader
  T=0ms:   User reads profile from follower --> sees OLD data
  T=100ms: Replication completes
  T=100ms: User reads profile from follower --> sees NEW data

This is called eventual consistency — followers will eventually have the same data as the leader, but there is a delay. For many applications (social media, content platforms), a few hundred milliseconds of lag is acceptable. For others (banking, inventory), it is not.

Solutions for Replication Lag

1. Read-your-own-writes: After a user writes data, route their reads to the leader (not a follower) for a short period. Other users can still read from followers.

2. Causal consistency: If operation B depends on operation A, ensure B is not visible until A is replicated. This prevents confusing scenarios.

Database Sharding

Sharding is splitting data across multiple database machines. Each machine holds a portion (a shard) of the total data.

Without sharding:
  [Single Database: 1 TB, all data]

With sharding:
  [Shard 1: 250 GB, users A-G]
  [Shard 2: 250 GB, users H-N]
  [Shard 3: 250 GB, users O-T]
  [Shard 4: 250 GB, users U-Z]

Sharding is necessary when a single database cannot handle the data volume or traffic. But it adds significant complexity.

Sharding Strategies

1. Range-Based Sharding

Split data based on ranges of a key value.

Range-based sharding by user ID:
  Shard 1: user_id 1 to 1,000,000
  Shard 2: user_id 1,000,001 to 2,000,000
  Shard 3: user_id 2,000,001 to 3,000,000

Pros: Simple. Range queries are easy (find all users with ID 500-1000). Cons: Uneven distribution. If new users are more active, the last shard gets more traffic (a “hot shard”).

2. Hash-Based Sharding

Apply a hash function to the key and use the result to determine the shard.

Hash-based sharding:
  shard_number = hash(user_id) % number_of_shards

  hash("user_001") % 4 = 2 --> Shard 2
  hash("user_002") % 4 = 0 --> Shard 0
  hash("user_003") % 4 = 3 --> Shard 3

Pros: Even distribution of data across shards. No hot shards. Cons: Range queries are impossible (you cannot find all users in a range because they are scattered). Adding or removing shards requires rehashing all data (this is solved by consistent hashing, covered in a future article).

3. Directory-Based Sharding

A lookup table tells you which shard holds each piece of data.

Directory table:
  user_001 --> Shard 2
  user_002 --> Shard 1
  user_003 --> Shard 3

Lookup:
  1. Check directory: where is user_002?
  2. Directory says: Shard 1
  3. Query Shard 1

Pros: Flexible. You can move data between shards by updating the directory. Cons: The directory is a single point of failure. Every query requires an extra lookup.

Sharding Challenges

Sharding is powerful but complex. Here are the main challenges:

1. Cross-shard queries. JOINs across shards are very expensive because data lives on different machines.

Problem: Find all orders for user_123
  User data: Shard 1
  Order data: Shard 3

  You need to query two different machines and combine results.

2. Rebalancing. When you add or remove shards, you must redistribute data. This is complex and resource-intensive.

3. Referential integrity. Foreign keys do not work across shards. You must enforce data integrity in your application code.

4. Operational complexity. Instead of one database to monitor and back up, you have many.

When to Shard

Sharding should be a last resort. Try these first:

  1. Optimize queries — add indexes, rewrite slow queries
  2. Vertical scaling — upgrade to a bigger database machine
  3. Read replicas — distribute read traffic across replicas
  4. Caching — reduce database load with Redis

If none of these help, then shard.

Database Indexing

An index is a data structure that speeds up data retrieval. Without an index, the database must scan every row to find what you are looking for (a “full table scan”).

Without index:
  Find user with email = "alex@example.com"
  Database scans all 10,000,000 rows
  Time: 5 seconds

With index on email column:
  Database uses index to jump directly to the row
  Time: 1 millisecond

How indexes work: An index is like the index at the back of a book. Instead of reading every page to find “caching,” you look up “caching” in the index and it tells you the page number.

Most SQL databases use B-tree indexes by default. They work well for equality lookups (WHERE email = ‘x’) and range queries (WHERE age BETWEEN 20 AND 30).

Trade-off: Indexes speed up reads but slow down writes. Every time you insert or update a row, the index must also be updated. Do not add indexes on every column — only on columns you query frequently.

For a deep dive into indexing, see the SQL Tutorial #6: Indexes and Performance.

Choosing the Right Database

Here is a decision framework:

Start with these questions:
  1. What does your data look like? (structured, semi-structured, graph)
  2. What are your access patterns? (reads vs writes, simple lookups vs complex queries)
  3. How much data do you have? (GBs, TBs, PBs)
  4. Do you need ACID transactions?
  5. How important is horizontal scaling?

Decision tree:
  Need ACID + complex queries + relationships?
    --> PostgreSQL or MySQL

  Need flexible schema + document storage?
    --> MongoDB

  Need extreme write throughput + horizontal scaling?
    --> Cassandra

  Need fast key-value lookups + caching?
    --> Redis or DynamoDB

  Need full-text search?
    --> Elasticsearch

  Need graph queries (social connections, recommendations)?
    --> Neo4j

  Need time-series data (metrics, IoT)?
    --> TimescaleDB or InfluxDB

Real-World Database Choices

CompanyPrimary DBOther DBsWhy
InstagramPostgreSQLCassandra, RedisACID for users/orders, Cassandra for feed
UberMySQLCassandra, RedisMySQL sharded for trips, Cassandra for analytics
NetflixCassandraMySQL, RedisCassandra for massive scale, MySQL for billing
TwitterMySQLManhattan (custom), RedisMySQL sharded, custom DB for tweets
StripePostgreSQLRedisACID is critical for payments

Notice that most large companies use multiple databases. They pick the best tool for each job.

Interview Tips

When discussing databases in a system design interview:

  1. Always justify your choice. Do not just say “I will use PostgreSQL.” Say “I will use PostgreSQL because this data has relationships and we need ACID transactions for payments.”
  2. Know the trade-offs of SQL vs NoSQL. Be ready to explain when each is better.
  3. Mention replication early. “I will use a leader-follower setup with read replicas to handle read traffic.”
  4. Do not shard unless necessary. “I would start with a single database, add read replicas, and only shard when we exceed the capacity of a single machine.”
  5. Discuss indexing. “I would add an index on the email column since we query by email frequently.”
  6. Combine databases. “For this system, I would use PostgreSQL for user data and Redis for caching and sessions.”

What’s Next?

In the next article, System Design #6: CAP Theorem and Consistency Patterns, you will learn:

  • CAP Theorem: Consistency, Availability, Partition Tolerance
  • Why you must choose between consistency and availability
  • Strong consistency vs eventual consistency
  • Real-world examples of CP and AP systems
  • Quorum reads and writes

This is part 5 of the System Design Tutorial series. Follow along to learn system design from scratch.