In the previous article, you designed a file storage system. Now let us design a notification system that sends push notifications, emails, and SMS messages to millions of users.

Every large application needs notifications. Whether it is a new message alert, an order confirmation, or a security warning, the notification system is a critical piece of infrastructure.

Step 1: Requirements

Functional Requirements

  1. Send push notifications (iOS and Android)
  2. Send email notifications
  3. Send SMS notifications
  4. Support different notification types: transactional, marketing, system alerts
  5. User preferences: opt-in/opt-out per channel, quiet hours
  6. Template-based notifications
  7. Delivery tracking and analytics

Non-Functional Requirements

  1. Soft real-time: transactional notifications within 30 seconds
  2. At-least-once delivery (no lost notifications)
  3. High throughput: 10 million notifications per minute during peaks
  4. No duplicate notifications (deduplication)
  5. Scalable to billions of notifications per day

Step 2: Estimation

Notifications per day: 5 billion
  Push: 3 billion (60%)
  Email: 1.5 billion (30%)
  SMS: 500 million (10%)

Peak load: 10 million per minute = ~167,000 per second

Per notification:
  Push: ~500 bytes payload
  Email: ~5 KB (with HTML template)
  SMS: ~200 bytes

Storage for notification history:
  5 billion * 1 KB (average) = 5 TB/day
  Retention: 30 days = 150 TB

Step 3: Notification Types

Different notifications have different priorities and requirements.

Notification Priority Levels:

  Critical (Priority 1):
    - Security alerts (suspicious login, password change)
    - Payment confirmations
    - Two-factor authentication codes
    Requirement: deliver within 10 seconds
    Retry: aggressive (retry 5 times in 1 minute)

  High (Priority 2):
    - New message received
    - Order status update
    - Friend request
    Requirement: deliver within 30 seconds
    Retry: 3 times with exponential backoff

  Normal (Priority 3):
    - Weekly digest
    - Product recommendations
    - Social activity summary
    Requirement: deliver within 5 minutes
    Retry: 2 times with backoff

  Low (Priority 4):
    - Marketing campaigns
    - Feature announcements
    - Newsletter
    Requirement: deliver within 1 hour
    Retry: 1 time, then give up

Step 4: High-Level Architecture

Architecture:

  [Notification Trigger]
    (any service in the system can trigger a notification)
        |
  [Notification Service API]
        |
  [Validation + Enrichment]
    - Check user preferences (opted out? quiet hours?)
    - Check deduplication (already sent this?)
    - Enrich with user data (name, locale, timezone)
    - Select template and render content
        |
  [Priority Queue (Kafka)]
    - Topic per priority level
    - Topic per channel (push, email, sms)
        |
  +-----+-----+-----+
  |     |     |     |
  v     v     v     v
[Push   [Email  [SMS
 Worker] Worker] Worker]
  |       |       |
  v       v       v
[APNs  [SES/   [Twilio/
 FCM]   SendGrid] SNS]
        |
  [Delivery Status Callback]
        |
  [Analytics + Monitoring]

Step 5: Push Notifications

Apple Push Notification Service (APNs)

APNs Flow:

  1. Your server sends notification to APNs
  2. APNs delivers to the iPhone

  [Your Server] --HTTPS--> [APNs] --push--> [iPhone]

  Payload (max 4 KB):
  {
    "aps": {
      "alert": {
        "title": "New Message",
        "body": "Sam: Hey, are you coming tonight?"
      },
      "badge": 3,
      "sound": "default"
    },
    "conversation_id": "conv_123"
  }

  Device Token:
    Each device has a unique token (obtained during app registration).
    Store device tokens in a database:
    | user_id | device_token | platform | updated_at |
    | user_1  | abc123...    | ios      | 2026-06-01 |
    | user_1  | def456...    | android  | 2026-06-01 |

  Token Refresh:
    Tokens can change (app reinstall, OS update).
    Update tokens when APNs returns "invalid token" errors.

Firebase Cloud Messaging (FCM)

FCM Flow:

  1. Your server sends notification to FCM
  2. FCM delivers to the Android device (and web browsers)

  [Your Server] --HTTPS--> [FCM] --push--> [Android Phone]

  FCM supports:
    - Notification messages (displayed by OS)
    - Data messages (handled by your app code)
    - Topic messaging (send to all subscribed devices)

  FCM Payload:
  {
    "message": {
      "token": "device_token_here",
      "notification": {
        "title": "Order Shipped",
        "body": "Your order #5001 is on its way!"
      },
      "data": {
        "order_id": "5001",
        "tracking_url": "https://track.example.com/5001"
      }
    }
  }

Step 6: Email Notifications

Email Architecture:

  [Email Worker] --> [Template Engine] --> [Email Service Provider]
                                                |
                                          [SES / SendGrid / Mailgun]
                                                |
                                          [Recipient's email server]
                                                |
                                          [Recipient's inbox]

  Template Engine:
    Templates are stored in a database or file system.
    Variables are replaced at send time.

    Template: "Hi {{name}}, your order #{{order_id}} has shipped!"
    Data: { name: "Alex", order_id: "5001" }
    Result: "Hi Alex, your order #5001 has shipped!"

  Email Service Provider (ESP) comparison (approximate 2025-2026 rates):
    | Provider    | Price (per 1000 emails) | Best for          |
    |-------------|------------------------|-------------------|
    | AWS SES     | $0.10                  | Transactional     |
    | SendGrid    | $1.00-1.50             | Marketing + Trans |
    | Mailgun     | $1.30-1.80             | Developer-focused |
    | Postmark    | $1.25-1.80             | Transactional     |

  Note: Always verify pricing on provider websites — rates change frequently.

  For high volume (1.5B emails/day), use multiple ESPs:
    - Primary: AWS SES (cheapest)
    - Failover: SendGrid (if SES is down or rate-limited)
    - Different ESPs for different regions

Email Deliverability

Deliverability Best Practices:

  1. Authentication:
     - SPF: declare which servers can send email for your domain
     - DKIM: sign emails cryptographically
     - DMARC: policy for failed authentication

  2. Reputation:
     - Warm up new IPs gradually (start with 100 emails/day)
     - Monitor bounce rate (keep below 2%)
     - Monitor spam complaint rate (keep below 0.1%)

  3. Content:
     - Include unsubscribe link (required by law)
     - Avoid spam trigger words
     - Test with spam checkers before sending

  4. List hygiene:
     - Remove hard bounces immediately
     - Remove inactive users after 6 months
     - Double opt-in for marketing emails

Step 7: SMS Notifications

SMS Architecture:

  [SMS Worker] --> [SMS Provider API] --> [Carrier Network] --> [Phone]

  SMS Providers:
    | Provider    | Price (per SMS) | Coverage    |
    |-------------|----------------|-------------|
    | Twilio      | $0.0079/msg    | Global      |
    | AWS SNS     | $0.0075/msg    | Global      |
    | Vonage      | $0.0068/msg    | Global      |
    | MessageBird | $0.0065/msg    | Europe/Asia |

  SMS Considerations:
    - SMS is expensive at scale (500M SMS/day * $0.007 = $3.5M/day!)
    - Use SMS only for critical notifications (2FA, security alerts)
    - Prefer push notifications when possible (free)
    - Character limit: 160 chars (70 for Unicode)
    - No formatting (plain text only)

  SMS Fallback Strategy:
    1. Try push notification first (free)
    2. If user has no push token or push fails --> send SMS
    3. For 2FA codes: always send SMS (most reliable)

Step 8: User Preferences

User Preferences:

  Table: notification_preferences
    | user_id | channel | category       | enabled | quiet_start | quiet_end |
    |---------|---------|----------------|---------|-------------|-----------|
    | user_1  | push    | messages       | true    | 22:00       | 08:00     |
    | user_1  | push    | marketing      | false   | null        | null      |
    | user_1  | email   | messages       | true    | null        | null      |
    | user_1  | email   | marketing      | true    | null        | null      |
    | user_1  | sms     | security       | true    | null        | null      |
    | user_1  | sms     | marketing      | false   | null        | null      |

  Quiet Hours:
    User sets "do not disturb" from 22:00 to 08:00 in their timezone.
    During quiet hours:
      - Critical notifications (security, 2FA): send immediately
      - Other notifications: queue and send after quiet hours end

  Preference Check Flow:
    1. Notification arrives for user_1, category "messages", channel "push"
    2. Check: is push enabled for messages? Yes.
    3. Check: is it quiet hours? 23:00 in user's timezone. Yes!
    4. Check: is it critical? No (it is a message).
    5. Queue the notification for 08:00 in user's timezone.

Step 9: Deduplication

Duplicate notifications are a bad user experience. Send each notification exactly once.

Deduplication Strategy:

  Generate a unique notification_id for each logical notification.

  notification_id = hash(user_id + ":" + event_type + ":" + event_id + ":" + timestamp_bucket)

  Using delimiters prevents hash collisions between different field combinations
  (e.g., user_id="ab", event_type="c" vs user_id="a", event_type="bc").

  Example:
    "New message from Sam to Alex" notification_id:
    hash("user_alex:new_message:msg_5001:2026-06-04T10:00")

  Before sending:
    1. Check Redis: "Has notification {id} been sent?"
       SET notification:{id} 1 NX EX 86400  (NX = only if not exists, EX = 24h TTL)
    2. If SET succeeds (key did not exist): send the notification
    3. If SET fails (key exists): skip (already sent)

  This prevents:
    - Retry storms from sending duplicate notifications
    - Multiple triggers for the same event
    - Race conditions between multiple notification workers

Step 10: Rate Limiting

Do not overwhelm users with too many notifications.

Rate Limiting:

  Per-User Limits:
    Push: max 50 per hour (except critical)
    Email: max 10 per hour (except transactional)
    SMS: max 5 per hour (except 2FA)

  Per-Channel Limits (protect third-party providers):
    APNs: no published hard cap (tune via connection reuse and error handling)
    FCM: ~600,000 requests per minute (default quota, adjustable)
    SES: account-dependent (sandbox: 1/sec; production: varies by region, request increases)

  Implementation:
    Use token bucket rate limiter per user per channel.
    Stored in Redis for fast lookups.

  When rate limited:
    Critical notifications: send anyway (bypass rate limit)
    High priority: queue for later delivery
    Low priority: drop and log

Step 11: Retry and Failure Handling

Retry Strategy:

  Transient failures (network timeout, provider busy):
    Retry with exponential backoff:
      Attempt 1: wait 1 second
      Attempt 2: wait 4 seconds
      Attempt 3: wait 16 seconds
      Attempt 4: wait 64 seconds
      Attempt 5: give up, move to dead letter queue

  Permanent failures:
    Invalid device token: remove token from database
    Invalid email address: mark as bounced
    Phone number not in service: mark as invalid

  Dead Letter Queue (DLQ):
    Notifications that fail all retries go to the DLQ.
    Operations team monitors the DLQ.
    DLQ notifications can be manually retried after fixing the issue.

  Circuit Breaker:
    If a provider (e.g., APNs) fails > 50% of requests in 1 minute:
      Open the circuit breaker.
      Stop sending to APNs for 60 seconds.
      Redirect to backup provider or queue for later.
      After 60 seconds, try again (half-open state).

Step 12: Analytics and Monitoring

Delivery Tracking:

  Notification Lifecycle:
    1. CREATED: notification received by the service
    2. VALIDATED: passed preference and dedup checks
    3. QUEUED: placed in priority queue
    4. SENT: delivered to provider (APNs/FCM/SES)
    5. DELIVERED: provider confirmed delivery
    6. OPENED: user opened/clicked the notification
    7. FAILED: delivery failed

  Metrics to Track:
    | Metric              | Target    | Alert Threshold |
    |---------------------|-----------|-----------------|
    | Delivery rate       | > 99%     | < 95%           |
    | Push delivery time  | < 5 sec   | > 30 sec        |
    | Email delivery time | < 60 sec  | > 5 min         |
    | Email bounce rate   | < 2%      | > 5%            |
    | Email spam rate     | < 0.1%    | > 0.5%          |
    | SMS delivery rate   | > 95%     | < 90%           |
    | Queue depth         | < 10,000  | > 100,000       |

  Analytics Dashboard:
    - Notifications sent per channel per hour
    - Delivery success rate per channel
    - Average delivery latency
    - Open rate and click-through rate (for email/push)
    - Top notification types by volume

Complete Architecture

                [Any Service] --trigger--> [Notification API]
                                                |
                                          [Validation]
                                          - User preferences
                                          - Deduplication (Redis)
                                          - Rate limiting
                                          - Template rendering
                                                |
                                          [Priority Queue (Kafka)]
                                          /        |        \
                                   [Priority 1] [Priority 2] [Priority 3-4]
                                   (fast lane)  (normal)     (batch)
                                        |
                              +---------+---------+
                              |         |         |
                        [Push      [Email     [SMS
                         Workers]   Workers]   Workers]
                           |          |          |
                        [APNs]     [SES]      [Twilio]
                        [FCM]      [SendGrid]  [SNS]
                           |          |          |
                     [Delivery Callbacks]
                           |
                     [Analytics DB (ClickHouse)]
                           |
                     [Monitoring Dashboard (Grafana)]

  Supporting:
    [Template Store (PostgreSQL)]
    [User Preferences (PostgreSQL + Redis cache)]
    [Device Token Store (PostgreSQL)]
    [Dead Letter Queue (Kafka DLQ)]

Common Mistakes

  1. Sending notifications synchronously. The triggering service should not wait for the notification to be delivered. Use a message queue for async processing.

  2. No deduplication. Without dedup, retries and race conditions cause duplicate notifications. Users get annoyed fast.

  3. Same priority for all notifications. A 2FA code must arrive in seconds. A marketing email can wait hours. Use priority queues.

  4. Not respecting user preferences. Sending notifications to users who opted out violates trust and possibly the law (GDPR, CAN-SPAM). Always check preferences before sending.

  5. No rate limiting per user. Sending 50 push notifications in one minute from different features creates a terrible experience.

Interview Tips

  1. Start with the notification types. “I will support three channels: push, email, and SMS. Each has different delivery characteristics and costs.”

  2. Mention priority levels. “Security alerts are critical and bypass quiet hours. Marketing notifications are low priority and can be batched.”

  3. Draw the queue-based architecture. “Notification triggers go to Kafka topics separated by priority and channel. Workers consume and deliver.”

  4. Discuss reliability. “I use at-least-once delivery with deduplication to ensure every notification is sent exactly once.”

  5. Talk about third-party providers. “Push uses APNs and FCM. Email uses SES with SendGrid as failover. SMS uses Twilio.”

  6. Mention analytics. “I track delivery rate, latency, and open rate. Alerts fire if delivery rate drops below 95%.”

What’s Next?

In the next article, System Design #19: Design a Search Engine, you will learn:

  • Web crawling and URL frontier
  • Inverted indexes for fast text search
  • PageRank and ranking algorithms
  • Vector search and semantic search

This is part 18 of the System Design Tutorial series. Follow along to learn system design from scratch.