System Design #10: Rate Limiting and Throttling

In the previous article, you learned about microservices and monolith architectures. Now let us talk about protecting your APIs from abuse: rate limiting.

Rate limiting controls how many requests a client can make in a given time period. Without it, a single client can overwhelm your servers, intentionally or by accident.

Why Every API Needs Rate Limiting

1. Prevent Abuse

A malicious user can send thousands of requests per second to overload your servers. Rate limiting stops them before they cause damage.

2. Ensure Fair Usage

Without limits, one heavy user can consume all server resources, making the system slow for everyone else. Rate limiting ensures each user gets a fair share.

3. Control Costs

Every API request costs money (compute, bandwidth, database queries). A runaway script making millions of requests can cause a huge bill. Rate limiting caps this.

4. Protect Downstream Services

Your API might call other services (databases, third-party APIs, payment providers). If your API gets flooded with requests, it floods everything downstream. Rate limiting at the front door protects the entire system.

Without Rate Limiting:

  Attacker sends 100,000 req/sec
    --> [API Server] overloaded
    --> [Database] overloaded
    --> [Payment Service] overloaded
    --> Everything is down

With Rate Limiting:

  Attacker sends 100,000 req/sec
    --> [Rate Limiter] blocks 99,900 requests (429 Too Many Requests)
    --> [API Server] receives 100 req/sec (normal load)
    --> System stays healthy

Rate Limiting Algorithms

There are five common algorithms. Each has different trade-offs between simplicity, accuracy, and memory usage.

1. Token Bucket

The most popular algorithm. Used by AWS, Stripe, and most API providers.

Imagine a bucket that holds tokens. The bucket fills with tokens at a fixed rate. Each request takes one token from the bucket. If the bucket is empty, the request is rejected.

Token Bucket:

  Bucket capacity: 10 tokens
  Refill rate: 2 tokens per second

  Time 0s:  Bucket has 10 tokens
  Request 1: take 1 token  --> 9 tokens left   --> ALLOWED
  Request 2: take 1 token  --> 8 tokens left   --> ALLOWED
  ...
  Request 10: take 1 token --> 0 tokens left   --> ALLOWED
  Request 11: no tokens    --> 0 tokens left   --> REJECTED (429)

  Time 1s:  Bucket refills 2 tokens --> 2 tokens
  Request 12: take 1 token --> 1 token left    --> ALLOWED
  Request 13: take 1 token --> 0 tokens left   --> ALLOWED
  Request 14: no tokens    --> 0 tokens left   --> REJECTED (429)

  The bucket never exceeds capacity (10).
  This allows short BURSTS (up to 10 requests at once)
  but sustains a rate of 2 req/sec over time.

Implementation in pseudocode:

class TokenBucket:
    capacity = 10
    tokens = 10          # instance field — persisted between requests
    refill_rate = 2      # tokens per second
    last_refill_time = now()  # instance field — persisted between requests

    function allow_request():
        # Refill tokens based on elapsed time
        elapsed = now() - this.last_refill_time
        this.tokens = min(this.capacity, this.tokens + elapsed * this.refill_rate)
        this.last_refill_time = now()

        if this.tokens >= 1:
            this.tokens = this.tokens - 1
            return true   # ALLOWED
        else:
            return false  # REJECTED

Pros: Simple, memory-efficient (just 2 numbers per user), allows bursts. Cons: The burst capability can be a problem if you need strict rate enforcement.

2. Leaky Bucket

Similar to token bucket but requests are processed at a fixed rate. Think of a bucket with a hole at the bottom. Water (requests) flows in at the top and drains out the hole at a constant rate.

Leaky Bucket:

  Queue capacity: 5 requests
  Processing rate: 2 requests per second

  Requests arrive: [R1, R2, R3, R4, R5, R6, R7]

  Queue: [R1, R2, R3, R4, R5]  -- R6 and R7 are REJECTED (queue full)

  Processing:
    T=0.0s: process R1
    T=0.5s: process R2
    T=1.0s: process R3
    T=1.5s: process R4
    T=2.0s: process R5

  Requests are processed at a constant rate.
  No bursts -- output is always smooth.

Pros: Smooth output rate. Good for APIs that need consistent throughput (like sending emails or writing to a database). Cons: No burst handling. A legitimate user who has been idle cannot send a burst of requests.

3. Fixed Window Counter

Divide time into fixed windows (like 1-minute intervals). Count requests in each window. If the count exceeds the limit, reject the request.

Fixed Window Counter:

  Limit: 100 requests per minute

  Window: 10:00 - 10:01
    Requests: 0... 50... 98... 100 --> LIMIT REACHED
    Request 101 at 10:00:45 --> REJECTED

  Window: 10:01 - 10:02
    Counter resets to 0
    Requests: 0... 1... 2... --> ALLOWED

  Simple! Just one counter per window.

The edge-case problem:

Fixed Window Edge Case:

  Limit: 100 requests per minute

  10:00:00 - 10:00:59: 0 requests
  10:00:30 - 10:00:59: 100 requests (all at the end of window 1)
  10:01:00 - 10:01:30: 100 requests (all at the start of window 2)

  In the 1-minute period from 10:00:30 to 10:01:30:
    200 requests were processed! (double the limit)

  This happens because the counter resets at the window boundary.

Pros: Very simple, memory-efficient (one counter per window per user). Cons: The boundary problem allows up to 2x the rate limit at window edges.

4. Sliding Window Log

Keep a log of timestamps for each request. When a new request comes in, remove old timestamps (outside the window), then count the remaining ones.

Sliding Window Log:

  Limit: 3 requests per minute

  Request at 10:00:15 --> log: [10:00:15]                    --> ALLOWED (1/3)
  Request at 10:00:30 --> log: [10:00:15, 10:00:30]          --> ALLOWED (2/3)
  Request at 10:00:45 --> log: [10:00:15, 10:00:30, 10:00:45] --> ALLOWED (3/3)
  Request at 10:00:50 --> log has 3 entries in last minute    --> REJECTED

  Request at 10:01:20:
    Remove entries older than 10:00:20
    log: [10:00:30, 10:00:45] (10:00:15 was removed)
    Add new: [10:00:30, 10:00:45, 10:01:20]                 --> ALLOWED (3/3)

Pros: Very accurate. No boundary problems. Cons: Memory-heavy. You store a timestamp for every request. For high-traffic APIs, this uses a lot of memory.

5. Sliding Window Counter

A hybrid of fixed window counter and sliding window log. Combines the accuracy of sliding window with the efficiency of fixed counters.

Sliding Window Counter:

  Limit: 100 requests per minute

  Previous window (10:00 - 10:01): 80 requests
  Current window  (10:01 - 10:02): 30 requests so far

  Current time: 10:01:15 (25% into the current window)

  Weighted count = (previous window count * overlap percentage) + current window count
                 = (80 * 0.75) + 30
                 = 60 + 30
                 = 90

  90 < 100 --> ALLOWED

  This smooths out the boundary problem by considering the overlap
  with the previous window.

Pros: Good accuracy, low memory (just two counters per user). Cons: Not perfectly accurate (it is an approximation), but close enough for most use cases.

Algorithm Comparison

Algorithm	Memory	Accuracy	Allows Bursts	Complexity
Token Bucket	Low (2 values)	Good	Yes	Low
Leaky Bucket	Low (queue size)	Good	No	Low
Fixed Window	Low (1 counter)	Poor (edge case)	Yes (at edges)	Very Low
Sliding Window Log	High (all timestamps)	Excellent	No	Medium
Sliding Window Counter	Low (2 counters)	Good	Slightly	Low

Most common choices:

Token bucket for general API rate limiting (allows bursts)
Sliding window counter for strict rate enforcement
Leaky bucket for smoothing output to downstream services

Where to Implement Rate Limiting

1. API Gateway (Recommended)

The API gateway is the first layer that receives requests. Rate limiting here protects all backend services.

Rate Limiting at API Gateway:

  [Client] --> [API Gateway + Rate Limiter] --> [Backend Services]
                     |
                     |--> Rate limit exceeded? Return 429
                     |--> Rate limit OK? Forward to backend

  Pros: Centralized, protects all services, easy to manage.
  Tools: Kong, AWS API Gateway, Nginx, Envoy.

2. Application Layer

Each service implements its own rate limiting. Useful for fine-grained control per endpoint.

Rate Limiting at Application Layer:

  [API Gateway] --> [User Service with rate limiter]
                --> [Order Service with rate limiter]
                --> [Payment Service with rate limiter]

  Pros: Per-service control, different limits per endpoint.
  Cons: Duplicated logic, harder to manage globally.

3. Client-Side

The client limits its own requests. This is not a security measure (clients can be modified) but it helps well-behaved clients avoid hitting server limits.

Client-Side Rate Limiting:

  Mobile app: "I will only send 5 requests per second"
  SDK: built-in retry with exponential backoff

  Pros: Reduces unnecessary rejected requests, better user experience.
  Cons: Not a security measure. Malicious clients bypass this.

Best practice: Implement rate limiting at both the API gateway (global protection) and the application layer (fine-grained control).

HTTP Headers for Rate Limiting

Standard HTTP headers communicate rate limit information to clients:

Response Headers:

  HTTP/1.1 200 OK
  X-RateLimit-Limit: 100          # Maximum requests allowed per window
  X-RateLimit-Remaining: 42       # Requests remaining in current window
  X-RateLimit-Reset: 1779523200   # Unix timestamp when the window resets

When rate limited:

  HTTP/1.1 429 Too Many Requests
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 0
  X-RateLimit-Reset: 1779523200
  Retry-After: 30                 # Seconds until the client should retry

  Body:
  {
    "error": {
      "code": "RATE_LIMIT_EXCEEDED",
      "message": "Rate limit exceeded. Try again in 30 seconds."
    }
  }

These headers help well-behaved clients manage their request rate and implement proper retry logic.

Distributed Rate Limiting with Redis

When you have multiple API servers, each server needs access to a shared rate limit counter. Redis is the most common solution because it is fast (sub-millisecond) and supports atomic operations.

Distributed Rate Limiting:

  [Client] --> [Load Balancer] --> [API Server 1] --\
                               --> [API Server 2] ---+--> [Redis]
                               --> [API Server 3] --/

  All servers check and update the same counter in Redis.
  This ensures the rate limit applies globally, not per server.

Token Bucket with Redis

Redis commands for token bucket:

  Key: "rate_limit:user:123"
  Value: JSON { tokens: 8, last_refill: 1716451000 }

  Pseudocode:
    1. GET rate_limit:user:123
    2. Calculate tokens to add based on elapsed time
    3. If tokens >= 1: decrement tokens, SET new value, ALLOW
    4. If tokens < 1: REJECT

  Use Redis MULTI/EXEC or Lua scripting for atomicity.

Sliding Window with Redis

Redis sliding window using sorted sets:

  Key: "rate_limit:user:123"
  Score: request timestamp
  Member: unique request ID

  For each request:
    1. ZREMRANGEBYSCORE key 0 (now - window_size)   # remove old entries
    2. ZCARD key                                      # count entries
    3. If count < limit:
         ZADD key (now) (unique_id)                  # add new entry
         ALLOW
    4. Else:
         REJECT
    5. EXPIRE key window_size                         # always reset TTL to prevent memory leak

Redis Lua Script for Atomic Rate Limiting

To avoid race conditions between multiple API servers, use a Redis Lua script that performs the check and update atomically:

-- Redis Lua script: token bucket rate limiter
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('GET', key)
local tokens, last_refill

if data then
    local parsed = cjson.decode(data)
    tokens = parsed.tokens
    last_refill = parsed.last_refill
else
    tokens = capacity
    last_refill = now
end

-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
last_refill = now

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('SET', key, cjson.encode({tokens=tokens, last_refill=last_refill}))
    redis.call('EXPIRE', key, 60)
    return 1  -- ALLOWED
else
    redis.call('SET', key, cjson.encode({tokens=tokens, last_refill=last_refill}))
    redis.call('EXPIRE', key, 60)
    return 0  -- REJECTED
end

Rate Limiting Strategies

By IP Address

The simplest approach. Limit requests per IP address.

Rate limit: 100 requests per minute per IP

  IP 192.168.1.1 --> 100 requests --> ALLOWED
  IP 192.168.1.1 --> request 101  --> REJECTED
  IP 192.168.1.2 --> 50 requests  --> ALLOWED (different IP)

Problem: Users behind a shared IP (office, VPN, NAT) all share the same limit. One heavy user blocks everyone.

By User / API Key

Limit requests per authenticated user or API key.

Rate limit: 1000 requests per minute per API key

  API key "abc123" --> 1000 requests --> ALLOWED
  API key "abc123" --> request 1001  --> REJECTED
  API key "xyz789" --> 500 requests  --> ALLOWED (different key)

Better than IP-based because it identifies the actual client. Most APIs use this approach.

Tiered Rate Limits

Different limits for different users or plans.

Tiered Rate Limits:

  Free plan:        100 requests per minute
  Basic plan:       1,000 requests per minute
  Pro plan:         10,000 requests per minute
  Enterprise plan:  100,000 requests per minute (or custom)

  This is how Stripe, Twilio, and most SaaS APIs work.

Per-Endpoint Rate Limits

Different limits for different endpoints based on their cost.

Per-Endpoint Rate Limits:

  GET  /users/:id     --> 1000 req/min  (cheap, read-only)
  POST /users         --> 100 req/min   (creates data)
  POST /payments      --> 50 req/min    (expensive operation)
  GET  /search        --> 200 req/min   (compute-heavy)

DDoS Protection

Rate limiting is your first line of defense against DDoS (Distributed Denial of Service) attacks. But a sophisticated DDoS attack comes from thousands of different IP addresses, making simple rate limiting insufficient.

Layers of DDoS Protection

DDoS Protection Layers:

  Layer 1: CDN / Edge (Cloudflare, AWS CloudFront)
    --> Absorbs volumetric attacks at the edge
    --> Blocks known bad IPs and bot traffic
    --> Handles 10+ Tbps of attack traffic

  Layer 2: Load Balancer
    --> Connection rate limiting
    --> SYN flood protection

  Layer 3: API Gateway
    --> Application-level rate limiting
    --> Token bucket per user/IP

  Layer 4: Application
    --> Per-endpoint rate limiting
    --> Request validation (reject malformed requests early)

Additional DDoS Strategies

1. Geographic blocking: If your service only operates in the US and EU, block traffic from other regions during an attack.

2. CAPTCHA challenges: If you detect suspicious traffic, present a CAPTCHA before processing the request.

3. Adaptive rate limiting: Automatically tighten rate limits when traffic spikes are detected. During normal traffic, allow 100 req/min. During a spike, reduce to 20 req/min.

4. Request prioritization: During overload, prioritize requests from authenticated users and paid customers over anonymous traffic.

Practical Example: Rate Limiter Implementation

Here is a simple token bucket rate limiter in Go:

package ratelimiter

import (
    "sync"
    "time"
)

type TokenBucket struct {
    capacity   float64
    tokens     float64
    refillRate float64 // tokens per second
    lastRefill time.Time
    mu         sync.Mutex
}

func NewTokenBucket(capacity int, refillRate float64) *TokenBucket {
    return &TokenBucket{
        capacity:   float64(capacity),
        tokens:     float64(capacity),
        refillRate: refillRate,
        lastRefill: time.Now(),
    }
}

func (tb *TokenBucket) Allow() bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    // Refill tokens based on elapsed time
    now := time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()
    tb.tokens += elapsed * tb.refillRate
    if tb.tokens > tb.capacity {
        tb.tokens = tb.capacity
    }
    tb.lastRefill = now

    // Check if we have a token
    if tb.tokens >= 1 {
        tb.tokens--
        return true
    }
    return false
}

// Usage:
// limiter := NewTokenBucket(10, 2.0)  // 10 capacity, 2 tokens/sec
// if limiter.Allow() {
//     // process request
// } else {
//     // return 429 Too Many Requests
// }

For production use, consider the golang.org/x/time/rate package which provides a well-tested token bucket implementation.

Retry Strategies for Clients

When a client receives a 429 response, it should retry intelligently:

Exponential Backoff with Jitter

Exponential Backoff:

  Attempt 1: wait 1 second
  Attempt 2: wait 2 seconds
  Attempt 3: wait 4 seconds
  Attempt 4: wait 8 seconds
  Attempt 5: wait 16 seconds (or give up)

With Jitter (random variation):

  Attempt 1: wait 1 second + random(0, 500ms)
  Attempt 2: wait 2 seconds + random(0, 1000ms)
  Attempt 3: wait 4 seconds + random(0, 2000ms)

Jitter prevents the "thundering herd" problem where all clients
retry at the exact same time and overload the server again.

Respect Retry-After Header

The server sends a Retry-After header telling the client exactly when to retry. Good clients should use this value instead of guessing.

Interview Tips

When discussing rate limiting in a system design interview:

Mention it proactively. “I will add rate limiting at the API gateway — 100 requests per minute per user — to prevent abuse.”
Choose the right algorithm. “I will use a token bucket algorithm because it allows short bursts while enforcing an average rate.”
Mention Redis for distributed systems. “Since we have multiple API servers, I will use Redis to store rate limit counters centrally.”
Discuss the headers. “The API returns X-RateLimit-Remaining and Retry-After headers so clients can manage their request rate.”
Mention DDoS protection. “For DDoS protection, I would use Cloudflare at the edge to absorb volumetric attacks before they reach our servers.”
Talk about tiered limits. “Free users get 100 requests per minute. Premium users get 10,000. Enterprise customers get custom limits.”
Know the trade-offs. Rate limiting too aggressively frustrates legitimate users. Too loosely, and you are not protected. Start conservative and adjust based on usage data.

System Design #9: Microservices vs Monolith — Architecture patterns
System Design #8: API Design — REST, GraphQL, gRPC
System Design #4: Caching — Redis, Memcached, CDN
Go Tutorial #21: API Best Practices — Rate limiting in Go with golang.org/x/time/rate

What’s Next?

In the next article, System Design #11: Consistent Hashing, you will learn:

Why simple hashing breaks when you add or remove servers
How consistent hashing solves this problem
Virtual nodes for even distribution
Real-world usage in DynamoDB, Cassandra, and CDNs

This is part 10 of the System Design Tutorial series. Follow along to learn system design from scratch.

Why Every API Needs Rate Limiting#

1. Prevent Abuse#

2. Ensure Fair Usage#

3. Control Costs#

4. Protect Downstream Services#

Rate Limiting Algorithms#

1. Token Bucket#

2. Leaky Bucket#

3. Fixed Window Counter#

4. Sliding Window Log#

5. Sliding Window Counter#

Algorithm Comparison#

Where to Implement Rate Limiting#

1. API Gateway (Recommended)#

2. Application Layer#

3. Client-Side#

HTTP Headers for Rate Limiting#

Distributed Rate Limiting with Redis#

Token Bucket with Redis#

Sliding Window with Redis#

Redis Lua Script for Atomic Rate Limiting#

Rate Limiting Strategies#

By IP Address#

By User / API Key#

Tiered Rate Limits#

Per-Endpoint Rate Limits#

DDoS Protection#

Layers of DDoS Protection#

Additional DDoS Strategies#

Practical Example: Rate Limiter Implementation#

Retry Strategies for Clients#

Exponential Backoff with Jitter#

Respect Retry-After Header#

Interview Tips#

Related Articles#

What’s Next?#