In the previous article, you learned about the CAP Theorem and consistency patterns. Now let us look at one of the most important building blocks in distributed systems: message queues.
Almost every large-scale system uses message queues. They are the backbone of asynchronous communication between services.
What is a Message Queue?
A message queue is a system that stores messages sent by one service (the producer) and delivers them to another service (the consumer). The producer and consumer do not need to be online at the same time.
Think of it like a mailbox. You put a letter in the mailbox (produce a message). The mail carrier picks it up later (consumes the message). You do not need to wait for the carrier to arrive.
Synchronous communication (without queue):
[Order Service] --HTTP request--> [Payment Service]
| |
|<------HTTP response---------------|
|
Order Service WAITS for Payment Service to respond.
If Payment Service is slow or down, Order Service is stuck.
Asynchronous communication (with queue):
[Order Service] --publish--> [Message Queue] --consume--> [Payment Service]
|
Order Service sends the message and moves on immediately.
It does not wait for Payment Service.
Why Use Message Queues?
1. Decoupling
Services do not need to know about each other. The order service just publishes an “order created” event. It does not care who processes it. Tomorrow, you can add an analytics service that also reads the same events — without changing the order service.
Without queue (tight coupling):
[Order Service] --> [Payment Service]
[Order Service] --> [Email Service]
[Order Service] --> [Analytics Service]
Order Service knows about ALL downstream services.
Adding a new service = changing Order Service code.
With queue (loose coupling):
[Order Service] --> [Message Queue] --> [Payment Service]
--> [Email Service]
--> [Analytics Service]
Order Service knows only about the queue.
Adding a new service = subscribe to the queue. No code changes.
2. Load Leveling
Queues absorb traffic spikes. If you receive 10,000 orders per second during a flash sale, but your payment service can only process 1,000 per second, the queue holds the extra 9,000 messages. The payment service processes them at its own pace.
Traffic spike without queue:
10,000 req/sec --> [Payment Service: max 1,000/sec] --> 9,000 requests DROPPED
Traffic spike with queue:
10,000 req/sec --> [Queue: holds messages] --> [Payment Service: 1,000/sec]
Queue: 9,000 messages buffered
Payment Service catches up over the next 9 seconds.
3. Reliability
If a consumer crashes, the message stays in the queue. When the consumer restarts, it picks up where it left off. No data is lost.
4. Ordering
Some queues guarantee message ordering. If you send messages A, B, C in that order, the consumer receives them in the same order.
Two Models: Point-to-Point vs Publish-Subscribe
Point-to-Point (Queue)
One message is delivered to exactly one consumer. If multiple consumers are listening, only one gets each message. This is useful for task distribution.
Point-to-Point:
Producer --> [Queue] --> Consumer A (gets message 1)
--> Consumer B (gets message 2)
--> Consumer C (gets message 3)
Each message is processed by exactly ONE consumer.
Good for: order processing, task distribution.
Publish-Subscribe (Topic)
One message is delivered to ALL subscribers. Every consumer gets a copy of every message. This is useful for broadcasting events.
Publish-Subscribe:
Producer --> [Topic] --> Subscriber A (gets ALL messages)
--> Subscriber B (gets ALL messages)
--> Subscriber C (gets ALL messages)
Every subscriber gets a copy of every message.
Good for: event notifications, logging, analytics.
Apache Kafka
Kafka is the most popular distributed event streaming platform. It was created at LinkedIn and open-sourced in 2011. Kafka is designed for high throughput, durability, and horizontal scaling.
How Kafka Works
Kafka organizes messages into topics. A topic is a named stream of messages (like “orders” or “user-events”). Each topic is split into partitions for parallelism.
Kafka Architecture:
Producers --> [Topic: "orders"]
|-- Partition 0: [msg1, msg4, msg7, ...]
|-- Partition 1: [msg2, msg5, msg8, ...]
|-- Partition 2: [msg3, msg6, msg9, ...]
|
Consumer Group
|-- Consumer A reads Partition 0
|-- Consumer B reads Partition 1
|-- Consumer C reads Partition 2
Key Kafka Concepts
Topics and Partitions: A topic is divided into partitions. Each partition is an ordered, append-only log. Messages within a partition have a sequential ID called an offset.
Consumer Groups: A group of consumers that work together to consume a topic. Each partition is assigned to exactly one consumer in the group. This means you can scale consumers by adding more instances (up to the number of partitions).
Retention: Kafka keeps messages for a configurable period (default 7 days). Consumers can replay messages from any point in the past. This is different from traditional queues that delete messages after consumption.
Replication: Each partition is replicated across multiple Kafka brokers for fault tolerance. If one broker dies, another has a copy.
Kafka Replication (3 brokers, replication factor = 3):
Partition 0:
Broker 1: [Leader] -- handles reads/writes
Broker 2: [Follower] -- keeps a copy
Broker 3: [Follower] -- keeps a copy
If Broker 1 dies:
Broker 2 becomes the new Leader for Partition 0.
When to Use Kafka
- Event streaming: real-time analytics, user activity tracking, metrics collection
- Log aggregation: collecting logs from many services into one place
- Data pipelines: moving data between databases, data warehouses, search engines
- High throughput: millions of messages per second
- Replay capability: you need to reprocess old messages
Used by: LinkedIn, Netflix, Uber, Airbnb, Spotify, Twitter.
RabbitMQ
RabbitMQ is a traditional message broker. It was created in 2007 and uses the AMQP protocol. RabbitMQ is designed for complex routing, low latency, and reliable message delivery.
How RabbitMQ Works
RabbitMQ uses exchanges and queues. Producers send messages to an exchange. The exchange routes messages to one or more queues based on rules (called bindings). Consumers read from queues.
RabbitMQ Architecture:
Producer --> [Exchange] --> routing rules --> [Queue A] --> Consumer A
--> [Queue B] --> Consumer B
--> [Queue C] --> Consumer C
Exchange types:
- Direct: routes to queues by exact routing key match
- Fanout: routes to ALL bound queues (broadcast)
- Topic: routes by pattern matching on routing key
- Headers: routes by message header attributes
Key RabbitMQ Concepts
Acknowledgments: When a consumer processes a message, it sends an acknowledgment (ACK) back to RabbitMQ. If the consumer crashes before sending an ACK, RabbitMQ re-delivers the message to another consumer.
Message persistence: Messages can be stored on disk to survive broker restarts.
Priority queues: RabbitMQ supports message priorities, so high-priority messages are delivered first.
When to Use RabbitMQ
- Complex routing: you need to route messages based on content, headers, or patterns
- Request-reply pattern: RPC (remote procedure calls) over messaging
- Low latency: individual message delivery in milliseconds
- Small to medium throughput: thousands to tens of thousands of messages per second
- Traditional messaging: task queues, work distribution, notifications
Used by: Bloomberg, NASA, Instagram (some components), VMware.
Amazon SQS
Amazon SQS (Simple Queue Service) is a fully managed message queue service from AWS. You do not run any infrastructure — AWS handles scaling, availability, and durability.
SQS Queue Types
Standard Queue:
- Nearly unlimited throughput
- At-least-once delivery (messages may be delivered more than once)
- Best-effort ordering (messages may arrive out of order)
FIFO Queue:
- Up to 3,000 messages per second (with batching)
- Exactly-once processing
- Guaranteed message ordering
SQS Standard vs FIFO:
Standard Queue:
Sent: A, B, C, D
Received: B, A, D, C (order not guaranteed)
Message D might be delivered twice.
FIFO Queue:
Sent: A, B, C, D
Received: A, B, C, D (order guaranteed)
Each message delivered exactly once.
When to Use SQS
- AWS ecosystem: your system is already on AWS
- Zero operations: you do not want to manage queue infrastructure
- Simple queuing: you need a reliable queue without complex routing
- Auto-scaling: SQS scales automatically with no configuration
Used by: Any AWS-based system. Common in serverless architectures with Lambda.
Kafka vs RabbitMQ vs SQS
| Feature | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| Model | Distributed log | Message broker | Managed queue |
| Throughput | Millions/sec | Tens of thousands/sec | Unlimited (standard) |
| Ordering | Per partition | Per queue | FIFO queues only |
| Message retention | Configurable (days/weeks) | Until consumed | 14 days max |
| Replay | Yes (read old messages) | No (messages deleted after ACK) | No |
| Routing | Simple (topic + partition) | Complex (exchanges, bindings) | Simple (queue) |
| Latency | Milliseconds (batched) | Sub-millisecond | Milliseconds |
| Operations | Self-managed (or Confluent Cloud) | Self-managed (or CloudAMQP) | Fully managed (AWS) |
| Best for | Event streaming, data pipelines | Complex routing, RPC | Simple queuing on AWS |
Decision Framework
Choosing a message queue:
Need to replay old messages or build event pipelines?
--> Kafka
Need complex routing rules or request-reply patterns?
--> RabbitMQ
Already on AWS and want zero operations?
--> SQS
Need millions of messages per second?
--> Kafka
Need sub-millisecond latency for individual messages?
--> RabbitMQ
Need a simple, reliable queue with no infrastructure management?
--> SQS (or Google Pub/Sub, or Azure Service Bus)
Delivery Guarantees
Message delivery in distributed systems is tricky. There are three levels of guarantees:
At-Most-Once
The message is delivered zero or one time. It may be lost, but it is never duplicated.
At-Most-Once:
Producer sends message --> Broker receives it --> Consumer processes it
If the message is lost in transit, it is gone.
No retries. Simple but unreliable.
Use case: metrics, logs (losing one data point is OK)
At-Least-Once
The message is delivered one or more times. It is never lost, but it may be duplicated.
At-Least-Once:
Producer sends message --> Broker receives it --> Consumer processes it
Consumer crashes before ACK --> Broker re-delivers message
Consumer processes the same message TWICE.
Use case: most systems. Handle duplicates with idempotency.
Exactly-Once
The message is delivered exactly one time. It is never lost and never duplicated. This is the hardest guarantee to achieve.
Exactly-Once:
Producer sends message --> Broker receives it --> Consumer processes it
Message is processed exactly once, even if failures happen.
This requires coordination between producer, broker, and consumer.
Kafka supports exactly-once with idempotent producers + transactions.
SQS FIFO queues support exactly-once processing.
In practice, most systems use at-least-once delivery with idempotent consumers. An idempotent consumer produces the same result even if it processes the same message twice. For example, “set balance to $100” is idempotent (doing it twice has the same effect), but “add $100 to balance” is not (doing it twice adds $200).
Dead Letter Queues
A dead letter queue (DLQ) is a special queue that holds messages that cannot be processed. After a message fails processing a certain number of times (the “max retry count”), it is moved to the DLQ.
Dead Letter Queue Flow:
[Main Queue] --> Consumer tries to process message
|
|--> Success: message acknowledged, removed from queue
|
|--> Failure (retry 1): message goes back to main queue
|--> Failure (retry 2): message goes back to main queue
|--> Failure (retry 3): message moves to DLQ
[Dead Letter Queue] --> Engineers investigate why the message failed
--> Fix the bug
--> Replay messages from DLQ back to main queue
DLQs prevent a single bad message from blocking the entire queue. Without a DLQ, a “poison message” (one that always fails) would be retried forever, blocking other messages behind it.
Event-Driven Architecture
Message queues enable event-driven architecture (EDA). Instead of services calling each other directly, services emit events and other services react to them.
Traditional architecture (request-driven):
[User Service] --"create user"--> [Email Service] --"send welcome email"
--"create user"--> [Analytics] --"track signup"
--"create user"--> [Billing] --"create free plan"
User Service must know about every downstream service.
Event-driven architecture:
[User Service] --publishes "UserCreated" event--> [Event Bus / Kafka]
|
--> [Email Service] listens for "UserCreated"
--> [Analytics] listens for "UserCreated"
--> [Billing] listens for "UserCreated"
--> [New Service X] can subscribe anytime
User Service only knows about the event bus. Zero coupling.
Benefits of Event-Driven Architecture
- Loose coupling: services are independent. Adding or removing a service does not affect others.
- Scalability: each service scales independently based on its own load.
- Auditability: events are a record of everything that happened. You can replay them.
- Resilience: if one service is down, events wait in the queue until it recovers.
Challenges of Event-Driven Architecture
- Debugging is harder: tracing a request across multiple services and queues requires distributed tracing (like OpenTelemetry).
- Eventual consistency: events take time to propagate. The system is not immediately consistent.
- Message ordering: ensuring events are processed in the right order across services is complex.
- Schema evolution: changing the structure of an event (adding/removing fields) must be backward compatible.
Practical Example: E-Commerce Order Flow
Let us design an order processing flow using message queues.
E-Commerce Order Flow:
1. Customer places order
[API Server] --> publishes "OrderCreated" event to Kafka
2. Multiple services consume the event:
[Payment Service]
- Consumes "OrderCreated"
- Charges the customer
- Publishes "PaymentCompleted" or "PaymentFailed"
[Inventory Service]
- Consumes "OrderCreated"
- Reserves items
- Publishes "InventoryReserved" or "InventoryInsufficient"
[Email Service]
- Consumes "OrderCreated"
- Sends order confirmation email
3. Order Service listens for downstream events:
[Order Service]
- Consumes "PaymentCompleted" + "InventoryReserved"
- Updates order status to "Confirmed"
- Publishes "OrderConfirmed"
[Shipping Service]
- Consumes "OrderConfirmed"
- Creates shipping label
- Publishes "OrderShipped"
4. If payment fails:
[Order Service]
- Consumes "PaymentFailed"
- Publishes "OrderCancelled"
[Inventory Service]
- Consumes "OrderCancelled"
- Releases reserved items
This design is resilient. If the email service is down, orders still process. If the payment service is slow, orders queue up. Each service can be deployed, scaled, and updated independently.
Interview Tips
When discussing message queues in a system design interview:
Use queues for async operations. “The order confirmation email does not need to happen synchronously. I will put it on a queue so the user gets an immediate response.”
Mention Kafka for event streaming. “I will use Kafka for the event bus because we need message replay and high throughput.”
Mention RabbitMQ for routing. “I will use RabbitMQ for the notification service because we need to route messages to different channels (email, SMS, push) based on user preferences.”
Always mention dead letter queues. “Failed messages go to a dead letter queue so they do not block other messages.”
Discuss delivery guarantees. “I will use at-least-once delivery with idempotent consumers to avoid message loss without worrying about duplicates.”
Know the throughput numbers. Kafka handles millions of messages per second. RabbitMQ handles tens of thousands. SQS scales automatically.
Related Articles
- System Design #6: CAP Theorem — Consistency, availability, partition tolerance
- System Design #5: Databases — SQL vs NoSQL, replication
- Go Tutorial #25: Build a Microservice — Building microservices in Go
What’s Next?
In the next article, System Design #8: API Design, you will learn:
- REST API design and best practices
- GraphQL: when it beats REST
- gRPC: high-performance communication between services
- API gateway pattern
- Webhooks and API versioning
This is part 7 of the System Design Tutorial series. Follow along to learn system design from scratch.