In the previous article, you designed a file storage system. Now let us design a notification system that sends push notifications, emails, and SMS messages to millions of users.
Every large application needs notifications. Whether it is a new message alert, an order confirmation, or a security warning, the notification system is a critical piece of infrastructure.
Step 1: Requirements
Functional Requirements
- Send push notifications (iOS and Android)
- Send email notifications
- Send SMS notifications
- Support different notification types: transactional, marketing, system alerts
- User preferences: opt-in/opt-out per channel, quiet hours
- Template-based notifications
- Delivery tracking and analytics
Non-Functional Requirements
- Soft real-time: transactional notifications within 30 seconds
- At-least-once delivery (no lost notifications)
- High throughput: 10 million notifications per minute during peaks
- No duplicate notifications (deduplication)
- Scalable to billions of notifications per day
Step 2: Estimation
Notifications per day: 5 billion
Push: 3 billion (60%)
Email: 1.5 billion (30%)
SMS: 500 million (10%)
Peak load: 10 million per minute = ~167,000 per second
Per notification:
Push: ~500 bytes payload
Email: ~5 KB (with HTML template)
SMS: ~200 bytes
Storage for notification history:
5 billion * 1 KB (average) = 5 TB/day
Retention: 30 days = 150 TB
Step 3: Notification Types
Different notifications have different priorities and requirements.
Notification Priority Levels:
Critical (Priority 1):
- Security alerts (suspicious login, password change)
- Payment confirmations
- Two-factor authentication codes
Requirement: deliver within 10 seconds
Retry: aggressive (retry 5 times in 1 minute)
High (Priority 2):
- New message received
- Order status update
- Friend request
Requirement: deliver within 30 seconds
Retry: 3 times with exponential backoff
Normal (Priority 3):
- Weekly digest
- Product recommendations
- Social activity summary
Requirement: deliver within 5 minutes
Retry: 2 times with backoff
Low (Priority 4):
- Marketing campaigns
- Feature announcements
- Newsletter
Requirement: deliver within 1 hour
Retry: 1 time, then give up
Step 4: High-Level Architecture
Architecture:
[Notification Trigger]
(any service in the system can trigger a notification)
|
[Notification Service API]
|
[Validation + Enrichment]
- Check user preferences (opted out? quiet hours?)
- Check deduplication (already sent this?)
- Enrich with user data (name, locale, timezone)
- Select template and render content
|
[Priority Queue (Kafka)]
- Topic per priority level
- Topic per channel (push, email, sms)
|
+-----+-----+-----+
| | | |
v v v v
[Push [Email [SMS
Worker] Worker] Worker]
| | |
v v v
[APNs [SES/ [Twilio/
FCM] SendGrid] SNS]
|
[Delivery Status Callback]
|
[Analytics + Monitoring]
Step 5: Push Notifications
Apple Push Notification Service (APNs)
APNs Flow:
1. Your server sends notification to APNs
2. APNs delivers to the iPhone
[Your Server] --HTTPS--> [APNs] --push--> [iPhone]
Payload (max 4 KB):
{
"aps": {
"alert": {
"title": "New Message",
"body": "Sam: Hey, are you coming tonight?"
},
"badge": 3,
"sound": "default"
},
"conversation_id": "conv_123"
}
Device Token:
Each device has a unique token (obtained during app registration).
Store device tokens in a database:
| user_id | device_token | platform | updated_at |
| user_1 | abc123... | ios | 2026-06-01 |
| user_1 | def456... | android | 2026-06-01 |
Token Refresh:
Tokens can change (app reinstall, OS update).
Update tokens when APNs returns "invalid token" errors.
Firebase Cloud Messaging (FCM)
FCM Flow:
1. Your server sends notification to FCM
2. FCM delivers to the Android device (and web browsers)
[Your Server] --HTTPS--> [FCM] --push--> [Android Phone]
FCM supports:
- Notification messages (displayed by OS)
- Data messages (handled by your app code)
- Topic messaging (send to all subscribed devices)
FCM Payload:
{
"message": {
"token": "device_token_here",
"notification": {
"title": "Order Shipped",
"body": "Your order #5001 is on its way!"
},
"data": {
"order_id": "5001",
"tracking_url": "https://track.example.com/5001"
}
}
}
Step 6: Email Notifications
Email Architecture:
[Email Worker] --> [Template Engine] --> [Email Service Provider]
|
[SES / SendGrid / Mailgun]
|
[Recipient's email server]
|
[Recipient's inbox]
Template Engine:
Templates are stored in a database or file system.
Variables are replaced at send time.
Template: "Hi {{name}}, your order #{{order_id}} has shipped!"
Data: { name: "Alex", order_id: "5001" }
Result: "Hi Alex, your order #5001 has shipped!"
Email Service Provider (ESP) comparison (approximate 2025-2026 rates):
| Provider | Price (per 1000 emails) | Best for |
|-------------|------------------------|-------------------|
| AWS SES | $0.10 | Transactional |
| SendGrid | $1.00-1.50 | Marketing + Trans |
| Mailgun | $1.30-1.80 | Developer-focused |
| Postmark | $1.25-1.80 | Transactional |
Note: Always verify pricing on provider websites — rates change frequently.
For high volume (1.5B emails/day), use multiple ESPs:
- Primary: AWS SES (cheapest)
- Failover: SendGrid (if SES is down or rate-limited)
- Different ESPs for different regions
Email Deliverability
Deliverability Best Practices:
1. Authentication:
- SPF: declare which servers can send email for your domain
- DKIM: sign emails cryptographically
- DMARC: policy for failed authentication
2. Reputation:
- Warm up new IPs gradually (start with 100 emails/day)
- Monitor bounce rate (keep below 2%)
- Monitor spam complaint rate (keep below 0.1%)
3. Content:
- Include unsubscribe link (required by law)
- Avoid spam trigger words
- Test with spam checkers before sending
4. List hygiene:
- Remove hard bounces immediately
- Remove inactive users after 6 months
- Double opt-in for marketing emails
Step 7: SMS Notifications
SMS Architecture:
[SMS Worker] --> [SMS Provider API] --> [Carrier Network] --> [Phone]
SMS Providers:
| Provider | Price (per SMS) | Coverage |
|-------------|----------------|-------------|
| Twilio | $0.0079/msg | Global |
| AWS SNS | $0.0075/msg | Global |
| Vonage | $0.0068/msg | Global |
| MessageBird | $0.0065/msg | Europe/Asia |
SMS Considerations:
- SMS is expensive at scale (500M SMS/day * $0.007 = $3.5M/day!)
- Use SMS only for critical notifications (2FA, security alerts)
- Prefer push notifications when possible (free)
- Character limit: 160 chars (70 for Unicode)
- No formatting (plain text only)
SMS Fallback Strategy:
1. Try push notification first (free)
2. If user has no push token or push fails --> send SMS
3. For 2FA codes: always send SMS (most reliable)
Step 8: User Preferences
User Preferences:
Table: notification_preferences
| user_id | channel | category | enabled | quiet_start | quiet_end |
|---------|---------|----------------|---------|-------------|-----------|
| user_1 | push | messages | true | 22:00 | 08:00 |
| user_1 | push | marketing | false | null | null |
| user_1 | email | messages | true | null | null |
| user_1 | email | marketing | true | null | null |
| user_1 | sms | security | true | null | null |
| user_1 | sms | marketing | false | null | null |
Quiet Hours:
User sets "do not disturb" from 22:00 to 08:00 in their timezone.
During quiet hours:
- Critical notifications (security, 2FA): send immediately
- Other notifications: queue and send after quiet hours end
Preference Check Flow:
1. Notification arrives for user_1, category "messages", channel "push"
2. Check: is push enabled for messages? Yes.
3. Check: is it quiet hours? 23:00 in user's timezone. Yes!
4. Check: is it critical? No (it is a message).
5. Queue the notification for 08:00 in user's timezone.
Step 9: Deduplication
Duplicate notifications are a bad user experience. Send each notification exactly once.
Deduplication Strategy:
Generate a unique notification_id for each logical notification.
notification_id = hash(user_id + ":" + event_type + ":" + event_id + ":" + timestamp_bucket)
Using delimiters prevents hash collisions between different field combinations
(e.g., user_id="ab", event_type="c" vs user_id="a", event_type="bc").
Example:
"New message from Sam to Alex" notification_id:
hash("user_alex:new_message:msg_5001:2026-06-04T10:00")
Before sending:
1. Check Redis: "Has notification {id} been sent?"
SET notification:{id} 1 NX EX 86400 (NX = only if not exists, EX = 24h TTL)
2. If SET succeeds (key did not exist): send the notification
3. If SET fails (key exists): skip (already sent)
This prevents:
- Retry storms from sending duplicate notifications
- Multiple triggers for the same event
- Race conditions between multiple notification workers
Step 10: Rate Limiting
Do not overwhelm users with too many notifications.
Rate Limiting:
Per-User Limits:
Push: max 50 per hour (except critical)
Email: max 10 per hour (except transactional)
SMS: max 5 per hour (except 2FA)
Per-Channel Limits (protect third-party providers):
APNs: no published hard cap (tune via connection reuse and error handling)
FCM: ~600,000 requests per minute (default quota, adjustable)
SES: account-dependent (sandbox: 1/sec; production: varies by region, request increases)
Implementation:
Use token bucket rate limiter per user per channel.
Stored in Redis for fast lookups.
When rate limited:
Critical notifications: send anyway (bypass rate limit)
High priority: queue for later delivery
Low priority: drop and log
Step 11: Retry and Failure Handling
Retry Strategy:
Transient failures (network timeout, provider busy):
Retry with exponential backoff:
Attempt 1: wait 1 second
Attempt 2: wait 4 seconds
Attempt 3: wait 16 seconds
Attempt 4: wait 64 seconds
Attempt 5: give up, move to dead letter queue
Permanent failures:
Invalid device token: remove token from database
Invalid email address: mark as bounced
Phone number not in service: mark as invalid
Dead Letter Queue (DLQ):
Notifications that fail all retries go to the DLQ.
Operations team monitors the DLQ.
DLQ notifications can be manually retried after fixing the issue.
Circuit Breaker:
If a provider (e.g., APNs) fails > 50% of requests in 1 minute:
Open the circuit breaker.
Stop sending to APNs for 60 seconds.
Redirect to backup provider or queue for later.
After 60 seconds, try again (half-open state).
Step 12: Analytics and Monitoring
Delivery Tracking:
Notification Lifecycle:
1. CREATED: notification received by the service
2. VALIDATED: passed preference and dedup checks
3. QUEUED: placed in priority queue
4. SENT: delivered to provider (APNs/FCM/SES)
5. DELIVERED: provider confirmed delivery
6. OPENED: user opened/clicked the notification
7. FAILED: delivery failed
Metrics to Track:
| Metric | Target | Alert Threshold |
|---------------------|-----------|-----------------|
| Delivery rate | > 99% | < 95% |
| Push delivery time | < 5 sec | > 30 sec |
| Email delivery time | < 60 sec | > 5 min |
| Email bounce rate | < 2% | > 5% |
| Email spam rate | < 0.1% | > 0.5% |
| SMS delivery rate | > 95% | < 90% |
| Queue depth | < 10,000 | > 100,000 |
Analytics Dashboard:
- Notifications sent per channel per hour
- Delivery success rate per channel
- Average delivery latency
- Open rate and click-through rate (for email/push)
- Top notification types by volume
Complete Architecture
[Any Service] --trigger--> [Notification API]
|
[Validation]
- User preferences
- Deduplication (Redis)
- Rate limiting
- Template rendering
|
[Priority Queue (Kafka)]
/ | \
[Priority 1] [Priority 2] [Priority 3-4]
(fast lane) (normal) (batch)
|
+---------+---------+
| | |
[Push [Email [SMS
Workers] Workers] Workers]
| | |
[APNs] [SES] [Twilio]
[FCM] [SendGrid] [SNS]
| | |
[Delivery Callbacks]
|
[Analytics DB (ClickHouse)]
|
[Monitoring Dashboard (Grafana)]
Supporting:
[Template Store (PostgreSQL)]
[User Preferences (PostgreSQL + Redis cache)]
[Device Token Store (PostgreSQL)]
[Dead Letter Queue (Kafka DLQ)]
Common Mistakes
Sending notifications synchronously. The triggering service should not wait for the notification to be delivered. Use a message queue for async processing.
No deduplication. Without dedup, retries and race conditions cause duplicate notifications. Users get annoyed fast.
Same priority for all notifications. A 2FA code must arrive in seconds. A marketing email can wait hours. Use priority queues.
Not respecting user preferences. Sending notifications to users who opted out violates trust and possibly the law (GDPR, CAN-SPAM). Always check preferences before sending.
No rate limiting per user. Sending 50 push notifications in one minute from different features creates a terrible experience.
Interview Tips
Start with the notification types. “I will support three channels: push, email, and SMS. Each has different delivery characteristics and costs.”
Mention priority levels. “Security alerts are critical and bypass quiet hours. Marketing notifications are low priority and can be batched.”
Draw the queue-based architecture. “Notification triggers go to Kafka topics separated by priority and channel. Workers consume and deliver.”
Discuss reliability. “I use at-least-once delivery with deduplication to ensure every notification is sent exactly once.”
Talk about third-party providers. “Push uses APNs and FCM. Email uses SES with SendGrid as failover. SMS uses Twilio.”
Mention analytics. “I track delivery rate, latency, and open rate. Alerts fire if delivery rate drops below 95%.”
Related Articles
- System Design #17: Design a File Storage System — File sync and change notifications
- System Design #7: Message Queues — Kafka for async processing
- System Design #10: Rate Limiting — Protecting users and providers from overload
What’s Next?
In the next article, System Design #19: Design a Search Engine, you will learn:
- Web crawling and URL frontier
- Inverted indexes for fast text search
- PageRank and ranking algorithms
- Vector search and semantic search
This is part 18 of the System Design Tutorial series. Follow along to learn system design from scratch.