System Design #7: Message Queues — Kafka, RabbitMQ, SQS

In the previous article, you learned about the CAP Theorem and consistency patterns. Now let us look at one of the most important building blocks in distributed systems: message queues.

Almost every large-scale system uses message queues. They are the backbone of asynchronous communication between services.

What is a Message Queue?

A message queue is a system that stores messages sent by one service (the producer) and delivers them to another service (the consumer). The producer and consumer do not need to be online at the same time.

Think of it like a mailbox. You put a letter in the mailbox (produce a message). The mail carrier picks it up later (consumes the message). You do not need to wait for the carrier to arrive.

Synchronous communication (without queue):

  [Order Service] --HTTP request--> [Payment Service]
       |                                   |
       |<------HTTP response---------------|
       |
  Order Service WAITS for Payment Service to respond.
  If Payment Service is slow or down, Order Service is stuck.


Asynchronous communication (with queue):

  [Order Service] --publish--> [Message Queue] --consume--> [Payment Service]
       |
  Order Service sends the message and moves on immediately.
  It does not wait for Payment Service.

Why Use Message Queues?

1. Decoupling

Services do not need to know about each other. The order service just publishes an “order created” event. It does not care who processes it. Tomorrow, you can add an analytics service that also reads the same events — without changing the order service.

Without queue (tight coupling):

  [Order Service] --> [Payment Service]
  [Order Service] --> [Email Service]
  [Order Service] --> [Analytics Service]

  Order Service knows about ALL downstream services.
  Adding a new service = changing Order Service code.


With queue (loose coupling):

  [Order Service] --> [Message Queue] --> [Payment Service]
                                     --> [Email Service]
                                     --> [Analytics Service]

  Order Service knows only about the queue.
  Adding a new service = subscribe to the queue. No code changes.

2. Load Leveling

Queues absorb traffic spikes. If you receive 10,000 orders per second during a flash sale, but your payment service can only process 1,000 per second, the queue holds the extra 9,000 messages. The payment service processes them at its own pace.

Traffic spike without queue:

  10,000 req/sec --> [Payment Service: max 1,000/sec] --> 9,000 requests DROPPED

Traffic spike with queue:

  10,000 req/sec --> [Queue: holds messages] --> [Payment Service: 1,000/sec]
                     Queue: 9,000 messages buffered
                     Payment Service catches up over the next 9 seconds.

3. Reliability

If a consumer crashes, the message stays in the queue. When the consumer restarts, it picks up where it left off. No data is lost.

4. Ordering

Some queues guarantee message ordering. If you send messages A, B, C in that order, the consumer receives them in the same order.

Point-to-Point (Queue)

One message is delivered to exactly one consumer. If multiple consumers are listening, only one gets each message. This is useful for task distribution.

Point-to-Point:

  Producer --> [Queue] --> Consumer A (gets message 1)
                       --> Consumer B (gets message 2)
                       --> Consumer C (gets message 3)

  Each message is processed by exactly ONE consumer.
  Good for: order processing, task distribution.

One message is delivered to ALL subscribers. Every consumer gets a copy of every message. This is useful for broadcasting events.

Publish-Subscribe:

  Producer --> [Topic] --> Subscriber A (gets ALL messages)
                       --> Subscriber B (gets ALL messages)
                       --> Subscriber C (gets ALL messages)

  Every subscriber gets a copy of every message.
  Good for: event notifications, logging, analytics.

Apache Kafka

Kafka is the most popular distributed event streaming platform. It was created at LinkedIn and open-sourced in 2011. Kafka is designed for high throughput, durability, and horizontal scaling.

How Kafka Works

Kafka organizes messages into topics. A topic is a named stream of messages (like “orders” or “user-events”). Each topic is split into partitions for parallelism.

Kafka Architecture:

  Producers --> [Topic: "orders"]
                  |-- Partition 0: [msg1, msg4, msg7, ...]
                  |-- Partition 1: [msg2, msg5, msg8, ...]
                  |-- Partition 2: [msg3, msg6, msg9, ...]
                                        |
                                   Consumer Group
                                   |-- Consumer A reads Partition 0
                                   |-- Consumer B reads Partition 1
                                   |-- Consumer C reads Partition 2

Key Kafka Concepts

Topics and Partitions: A topic is divided into partitions. Each partition is an ordered, append-only log. Messages within a partition have a sequential ID called an offset.

Consumer Groups: A group of consumers that work together to consume a topic. Each partition is assigned to exactly one consumer in the group. This means you can scale consumers by adding more instances (up to the number of partitions).

Retention: Kafka keeps messages for a configurable period (default 7 days). Consumers can replay messages from any point in the past. This is different from traditional queues that delete messages after consumption.

Replication: Each partition is replicated across multiple Kafka brokers for fault tolerance. If one broker dies, another has a copy.

Kafka Replication (3 brokers, replication factor = 3):

  Partition 0:
    Broker 1: [Leader]    -- handles reads/writes
    Broker 2: [Follower]  -- keeps a copy
    Broker 3: [Follower]  -- keeps a copy

  If Broker 1 dies:
    Broker 2 becomes the new Leader for Partition 0.

When to Use Kafka

Event streaming: real-time analytics, user activity tracking, metrics collection
Log aggregation: collecting logs from many services into one place
Data pipelines: moving data between databases, data warehouses, search engines
High throughput: millions of messages per second
Replay capability: you need to reprocess old messages

Used by: LinkedIn, Netflix, Uber, Airbnb, Spotify, Twitter.

RabbitMQ

RabbitMQ is a traditional message broker. It was created in 2007 and uses the AMQP protocol. RabbitMQ is designed for complex routing, low latency, and reliable message delivery.

How RabbitMQ Works

RabbitMQ uses exchanges and queues. Producers send messages to an exchange. The exchange routes messages to one or more queues based on rules (called bindings). Consumers read from queues.

RabbitMQ Architecture:

  Producer --> [Exchange] --> routing rules --> [Queue A] --> Consumer A
                                           --> [Queue B] --> Consumer B
                                           --> [Queue C] --> Consumer C

Exchange types:
  - Direct:  routes to queues by exact routing key match
  - Fanout:  routes to ALL bound queues (broadcast)
  - Topic:   routes by pattern matching on routing key
  - Headers: routes by message header attributes

Key RabbitMQ Concepts

Acknowledgments: When a consumer processes a message, it sends an acknowledgment (ACK) back to RabbitMQ. If the consumer crashes before sending an ACK, RabbitMQ re-delivers the message to another consumer.

Message persistence: Messages can be stored on disk to survive broker restarts.

Priority queues: RabbitMQ supports message priorities, so high-priority messages are delivered first.

When to Use RabbitMQ

Complex routing: you need to route messages based on content, headers, or patterns
Request-reply pattern: RPC (remote procedure calls) over messaging
Low latency: individual message delivery in milliseconds
Small to medium throughput: thousands to tens of thousands of messages per second
Traditional messaging: task queues, work distribution, notifications

Used by: Bloomberg, NASA, Instagram (some components), VMware.

Amazon SQS

Amazon SQS (Simple Queue Service) is a fully managed message queue service from AWS. You do not run any infrastructure — AWS handles scaling, availability, and durability.

SQS Queue Types

Standard Queue:

Nearly unlimited throughput
At-least-once delivery (messages may be delivered more than once)
Best-effort ordering (messages may arrive out of order)

FIFO Queue:

Up to 3,000 messages per second (with batching)
Exactly-once processing
Guaranteed message ordering

SQS Standard vs FIFO:

  Standard Queue:
    Sent:     A, B, C, D
    Received: B, A, D, C (order not guaranteed)
    Message D might be delivered twice.

  FIFO Queue:
    Sent:     A, B, C, D
    Received: A, B, C, D (order guaranteed)
    Each message delivered exactly once.

When to Use SQS

AWS ecosystem: your system is already on AWS
Zero operations: you do not want to manage queue infrastructure
Simple queuing: you need a reliable queue without complex routing
Auto-scaling: SQS scales automatically with no configuration

Used by: Any AWS-based system. Common in serverless architectures with Lambda.

Kafka vs RabbitMQ vs SQS

Feature	Kafka	RabbitMQ	SQS
Model	Distributed log	Message broker	Managed queue
Throughput	Millions/sec	Tens of thousands/sec	Unlimited (standard)
Ordering	Per partition	Per queue	FIFO queues only
Message retention	Configurable (days/weeks)	Until consumed	14 days max
Replay	Yes (read old messages)	No (messages deleted after ACK)	No
Routing	Simple (topic + partition)	Complex (exchanges, bindings)	Simple (queue)
Latency	Milliseconds (batched)	Sub-millisecond	Milliseconds
Operations	Self-managed (or Confluent Cloud)	Self-managed (or CloudAMQP)	Fully managed (AWS)
Best for	Event streaming, data pipelines	Complex routing, RPC	Simple queuing on AWS

Decision Framework

Choosing a message queue:

  Need to replay old messages or build event pipelines?
    --> Kafka

  Need complex routing rules or request-reply patterns?
    --> RabbitMQ

  Already on AWS and want zero operations?
    --> SQS

  Need millions of messages per second?
    --> Kafka

  Need sub-millisecond latency for individual messages?
    --> RabbitMQ

  Need a simple, reliable queue with no infrastructure management?
    --> SQS (or Google Pub/Sub, or Azure Service Bus)

Delivery Guarantees

Message delivery in distributed systems is tricky. There are three levels of guarantees:

At-Most-Once

The message is delivered zero or one time. It may be lost, but it is never duplicated.

At-Most-Once:
  Producer sends message --> Broker receives it --> Consumer processes it

  If the message is lost in transit, it is gone.
  No retries. Simple but unreliable.

  Use case: metrics, logs (losing one data point is OK)

At-Least-Once

The message is delivered one or more times. It is never lost, but it may be duplicated.

At-Least-Once:
  Producer sends message --> Broker receives it --> Consumer processes it
  Consumer crashes before ACK --> Broker re-delivers message
  Consumer processes the same message TWICE.

  Use case: most systems. Handle duplicates with idempotency.

Exactly-Once

The message is delivered exactly one time. It is never lost and never duplicated. This is the hardest guarantee to achieve.

Exactly-Once:
  Producer sends message --> Broker receives it --> Consumer processes it
  Message is processed exactly once, even if failures happen.

  This requires coordination between producer, broker, and consumer.
  Kafka supports exactly-once with idempotent producers + transactions.
  SQS FIFO queues support exactly-once processing.

In practice, most systems use at-least-once delivery with idempotent consumers. An idempotent consumer produces the same result even if it processes the same message twice. For example, “set balance to $100” is idempotent (doing it twice has the same effect), but “add $100 to balance” is not (doing it twice adds $200).

Dead Letter Queues

A dead letter queue (DLQ) is a special queue that holds messages that cannot be processed. After a message fails processing a certain number of times (the “max retry count”), it is moved to the DLQ.

Dead Letter Queue Flow:

  [Main Queue] --> Consumer tries to process message
                     |
                     |--> Success: message acknowledged, removed from queue
                     |
                     |--> Failure (retry 1): message goes back to main queue
                     |--> Failure (retry 2): message goes back to main queue
                     |--> Failure (retry 3): message moves to DLQ

  [Dead Letter Queue] --> Engineers investigate why the message failed
                      --> Fix the bug
                      --> Replay messages from DLQ back to main queue

DLQs prevent a single bad message from blocking the entire queue. Without a DLQ, a “poison message” (one that always fails) would be retried forever, blocking other messages behind it.

Event-Driven Architecture

Message queues enable event-driven architecture (EDA). Instead of services calling each other directly, services emit events and other services react to them.

Traditional architecture (request-driven):

  [User Service] --"create user"--> [Email Service] --"send welcome email"
                 --"create user"--> [Analytics]     --"track signup"
                 --"create user"--> [Billing]       --"create free plan"

  User Service must know about every downstream service.


Event-driven architecture:

  [User Service] --publishes "UserCreated" event--> [Event Bus / Kafka]
                                                       |
                                                       --> [Email Service]    listens for "UserCreated"
                                                       --> [Analytics]        listens for "UserCreated"
                                                       --> [Billing]          listens for "UserCreated"
                                                       --> [New Service X]    can subscribe anytime

  User Service only knows about the event bus. Zero coupling.

Benefits of Event-Driven Architecture

Loose coupling: services are independent. Adding or removing a service does not affect others.
Scalability: each service scales independently based on its own load.
Auditability: events are a record of everything that happened. You can replay them.
Resilience: if one service is down, events wait in the queue until it recovers.

Challenges of Event-Driven Architecture

Debugging is harder: tracing a request across multiple services and queues requires distributed tracing (like OpenTelemetry).
Eventual consistency: events take time to propagate. The system is not immediately consistent.
Message ordering: ensuring events are processed in the right order across services is complex.
Schema evolution: changing the structure of an event (adding/removing fields) must be backward compatible.

Practical Example: E-Commerce Order Flow

Let us design an order processing flow using message queues.

E-Commerce Order Flow:

  1. Customer places order
     [API Server] --> publishes "OrderCreated" event to Kafka

  2. Multiple services consume the event:

     [Payment Service]
       - Consumes "OrderCreated"
       - Charges the customer
       - Publishes "PaymentCompleted" or "PaymentFailed"

     [Inventory Service]
       - Consumes "OrderCreated"
       - Reserves items
       - Publishes "InventoryReserved" or "InventoryInsufficient"

     [Email Service]
       - Consumes "OrderCreated"
       - Sends order confirmation email

  3. Order Service listens for downstream events:

     [Order Service]
       - Consumes "PaymentCompleted" + "InventoryReserved"
       - Updates order status to "Confirmed"
       - Publishes "OrderConfirmed"

     [Shipping Service]
       - Consumes "OrderConfirmed"
       - Creates shipping label
       - Publishes "OrderShipped"

  4. If payment fails:
     [Order Service]
       - Consumes "PaymentFailed"
       - Publishes "OrderCancelled"

     [Inventory Service]
       - Consumes "OrderCancelled"
       - Releases reserved items

This design is resilient. If the email service is down, orders still process. If the payment service is slow, orders queue up. Each service can be deployed, scaled, and updated independently.

Interview Tips

When discussing message queues in a system design interview:

Use queues for async operations. “The order confirmation email does not need to happen synchronously. I will put it on a queue so the user gets an immediate response.”
Mention Kafka for event streaming. “I will use Kafka for the event bus because we need message replay and high throughput.”
Mention RabbitMQ for routing. “I will use RabbitMQ for the notification service because we need to route messages to different channels (email, SMS, push) based on user preferences.”
Always mention dead letter queues. “Failed messages go to a dead letter queue so they do not block other messages.”
Discuss delivery guarantees. “I will use at-least-once delivery with idempotent consumers to avoid message loss without worrying about duplicates.”
Know the throughput numbers. Kafka handles millions of messages per second. RabbitMQ handles tens of thousands. SQS scales automatically.

System Design #6: CAP Theorem — Consistency, availability, partition tolerance
System Design #5: Databases — SQL vs NoSQL, replication
Go Tutorial #25: Build a Microservice — Building microservices in Go

What’s Next?

In the next article, System Design #8: API Design, you will learn:

REST API design and best practices
GraphQL: when it beats REST
gRPC: high-performance communication between services
API gateway pattern
Webhooks and API versioning

This is part 7 of the System Design Tutorial series. Follow along to learn system design from scratch.

What is a Message Queue?#

Why Use Message Queues?#

1. Decoupling#

2. Load Leveling#

3. Reliability#

4. Ordering#

Two Models: Point-to-Point vs Publish-Subscribe#

Point-to-Point (Queue)#

Publish-Subscribe (Topic)#

Apache Kafka#

How Kafka Works#

Key Kafka Concepts#

When to Use Kafka#

RabbitMQ#

How RabbitMQ Works#

Key RabbitMQ Concepts#

When to Use RabbitMQ#

Amazon SQS#

SQS Queue Types#

When to Use SQS#

Kafka vs RabbitMQ vs SQS#

Decision Framework#

Delivery Guarantees#

At-Most-Once#

At-Least-Once#

Exactly-Once#

Dead Letter Queues#

Event-Driven Architecture#

Benefits of Event-Driven Architecture#

Challenges of Event-Driven Architecture#

Practical Example: E-Commerce Order Flow#

Interview Tips#

Related Articles#

What’s Next?#