In the previous article, you designed a notification system. Now let us tackle one of the most complex systems in computing: a search engine.
Search engines like Google index over 100 billion web pages and serve billions of queries per day. We will design a simplified version that covers the core components: crawling, indexing, ranking, and serving.
Step 1: Requirements
Functional Requirements
- Crawl the web and discover new pages
- Index page content for fast retrieval
- Search by keywords and return ranked results
- Autocomplete (search suggestions as you type)
- Return results in under 500ms
Non-Functional Requirements
- Fresh results — new content indexed within hours
- Relevant results — best pages ranked first
- Scale to 100 billion indexed pages
- Handle 10 billion search queries per day
- High availability — search must always work
Step 2: Estimation
Indexed pages: 100 billion
Average page size: 50 KB (text content after stripping HTML)
Total index storage: 100B * 50 KB = 5 PB (raw text)
Inverted index size: ~20% of raw text = 1 PB
Queries: 10 billion per day
QPS: 10B / 86,400 = ~115,000 queries/sec
Peak QPS: ~350,000 queries/sec
Crawling:
New/updated pages per day: 1 billion
Crawl rate: 1B / 86,400 = ~11,600 pages/sec
Bandwidth: 11,600 * 100 KB (full page) = 1.16 GB/sec
Step 3: Web Crawler
The crawler discovers and downloads web pages. It is the first stage of the search engine pipeline.
Architecture
Crawler Architecture:
[Seed URLs]
|
[URL Frontier] (priority queue of URLs to crawl)
|
[DNS Resolver] (resolve domain to IP address, cache results)
|
[Fetcher] (download the page via HTTP)
|
[HTML Parser] (extract text content and links)
|
+----+----+
| |
v v
[Content [Extracted URLs]
Store] |
[URL Filter]
- Already crawled? (deduplicate)
- Robots.txt allows? (politeness)
- Same domain? (rate limit per domain)
|
[URL Frontier] (add new URLs to crawl)
URL Frontier
The URL frontier is a priority queue that decides which URLs to crawl next.
URL Frontier Design:
Two sub-queues:
1. Priority Queue (what to crawl):
- Important pages first (high PageRank, news sites, frequently updated)
- New discoveries get medium priority
- Deep pages get low priority
2. Politeness Queue (when to crawl):
- One queue per domain
- Each domain has a crawl delay (usually 1-10 seconds between requests)
- Respects robots.txt crawl-delay directive
Example:
google.com: crawl 1 page every 2 seconds
smallblog.com: crawl 1 page every 10 seconds
This prevents overwhelming websites with too many requests.
robots.txt example:
User-agent: *
Disallow: /private/
Crawl-delay: 5
This tells crawlers: do not crawl /private/ pages,
and wait at least 5 seconds between requests.
Deduplication
The crawler must avoid crawling the same page twice and avoid storing duplicate content.
URL Deduplication:
Use a Bloom filter to check if a URL has already been crawled.
Bloom filter: probabilistic data structure
- "Definitely not seen" or "probably seen"
- False positives are OK (skip a few pages unnecessarily)
- False negatives never happen (never miss a new page)
- 100 billion URLs in 10 GB of memory
Content Deduplication:
Two different URLs can have the same content (mirrors, redirects).
Use SimHash (locality-sensitive hashing):
1. Compute SimHash of the page content
2. Compare with existing SimHashes
3. If similar (Hamming distance < 3): treat as duplicate
SimHash can detect near-duplicates (pages with minor differences
like different headers/footers but same main content).
Step 4: Indexing
The indexer processes crawled pages and builds a searchable data structure.
Inverted Index
The inverted index is the core data structure of a search engine. It maps every word to the list of documents that contain it.
Inverted Index:
Forward index (what words are in each document):
doc_1: ["system", "design", "tutorial", "learn"]
doc_2: ["design", "patterns", "java"]
doc_3: ["system", "architecture", "scalable"]
Inverted index (which documents contain each word):
"system": [doc_1, doc_3]
"design": [doc_1, doc_2]
"tutorial": [doc_1]
"learn": [doc_1]
"patterns": [doc_2]
"java": [doc_2]
"architecture": [doc_3]
"scalable": [doc_3]
Query: "system design"
Look up "system": [doc_1, doc_3]
Look up "design": [doc_1, doc_2]
Intersection: [doc_1] <-- contains both words
Each entry also stores:
- Position of the word in the document (for phrase matching)
- Term frequency (how often the word appears)
- Document metadata (title, URL, page rank)
Posting List
Each word in the inverted index points to a posting list — a sorted list of document IDs with metadata.
Posting List for "system":
[
{ doc_id: 1, tf: 3, positions: [0, 15, 42] },
{ doc_id: 3, tf: 1, positions: [0] },
{ doc_id: 7, tf: 5, positions: [2, 8, 23, 45, 67] },
...
]
tf (term frequency): how many times "system" appears in the document
positions: where in the document (for phrase matching)
Sorted by doc_id for fast intersection of multiple posting lists.
Compression:
Posting lists can be huge (common words like "the" appear in billions of docs).
Use delta encoding + variable-byte encoding:
Original: [1, 3, 7, 15, 22]
Delta: [1, 2, 4, 8, 7]
VarByte: smaller binary representation
This reduces index size by 50-70%.
Index Building Pipeline
Indexing Pipeline:
[Crawled Pages]
|
[Text Extraction]
- Strip HTML tags
- Extract title, headings, body text, alt text
- Detect language
|
[Tokenization]
- Split text into words
- "System Design Tutorial" --> ["system", "design", "tutorial"]
|
[Normalization]
- Lowercase: "System" --> "system"
- Stemming: "running" --> "run", "designs" --> "design"
- Remove stop words: "the", "is", "a" (optional, modern engines keep them)
|
[Index Builder]
- For each word: add (doc_id, position) to the posting list
- Build in batches, then merge into the main index
|
[Inverted Index Store]
Step 5: Query Processing
When a user searches, the query goes through several stages.
Query Processing Pipeline:
User types: "best system design books 2026"
|
[Query Parser]
- Tokenize: ["best", "system", "design", "books", "2026"]
- Normalize: lowercase, stem
- Detect intent: informational query
- Spell check: "desing" --> "design" (did you mean?)
|
[Query Expansion]
- Synonyms: "books" --> also match "book", "textbook"
- Related terms: "system design" --> also consider "software architecture"
|
[Index Lookup]
- Look up each term in the inverted index
- Get posting lists for each term
- Intersect posting lists (AND) or union (OR)
|
[Scoring / Ranking]
- Score each matching document
- Rank by relevance score
|
[Result Assembly]
- Fetch document metadata (title, URL, snippet)
- Generate snippet (relevant excerpt with query terms highlighted)
- Return top 10 results
|
[User sees results]
Total time: < 500ms
Step 6: Ranking Algorithms
TF-IDF (Classic Approach)
TF-IDF (Term Frequency - Inverse Document Frequency):
TF (Term Frequency):
How often a word appears in a document.
tf("design", doc_1) = 3 occurrences / 100 total words = 0.03
IDF (Inverse Document Frequency):
How rare a word is across all documents.
"the" appears in 99% of documents --> low IDF (not useful for ranking)
"kubernetes" appears in 0.1% --> high IDF (very useful)
idf("design") = log(100B total docs / 500M docs with "design") = 2.3
TF-IDF Score:
score = tf * idf
Higher score = word is frequent in this document AND rare overall
This means the document is very relevant to that word.
Multi-word query:
score(doc, query) = sum of TF-IDF scores for each query term
PageRank (Link Analysis)
PageRank (simplified):
The idea: a page is important if important pages link to it.
Every page starts with a rank of 1.
In each iteration, each page distributes its rank to
all pages it links to.
Example:
Page A links to Page B and Page C.
Page A has rank 1.
Page A gives 0.5 rank to B and 0.5 rank to C.
After many iterations, pages with many inbound links
from important pages have the highest rank.
Formula:
PR(page) = (1-d) + d * sum(PR(linking_page) / outbound_links)
where d = damping factor (~0.85)
Real-world:
Google runs PageRank on the entire web graph (100B+ pages).
This is a massive distributed computation (MapReduce).
PageRank is combined with TF-IDF:
final_score = relevance_score (TF-IDF) * authority_score (PageRank)
Modern Ranking (ML-Based)
Modern Search Ranking (2026):
TF-IDF and PageRank are the foundation, but modern search engines
use machine learning models for ranking.
Two-stage ranking:
Stage 1: Candidate Retrieval (fast, simple)
- Use inverted index + TF-IDF to find top 1000 candidates
- Must be fast: < 50ms
Stage 2: Re-Ranking (slower, more accurate)
- ML model scores each of the 1000 candidates
- Features: TF-IDF score, PageRank, freshness, user location,
click-through rate, dwell time, page quality
- Model: gradient-boosted trees (LambdaMART) or neural network (BERT)
- Return top 10 results
Google uses BERT and MUM for understanding query intent:
Query: "can I use olive oil as medicine"
Old engines: match "olive oil" and "medicine" literally
BERT: understands the semantic meaning, matches relevant health articles
Step 7: Vector Search and Semantic Search
Traditional search matches keywords. Semantic search understands meaning.
Vector Search (2026 approach):
Traditional keyword search:
Query: "how to make my website faster"
Matches documents containing these exact words.
Misses: "web performance optimization techniques"
Semantic search with embeddings:
1. Convert query to a vector (embedding):
"how to make my website faster" --> [0.23, 0.87, -0.12, ...]
2. Compare with pre-computed document vectors:
"web performance optimization" --> [0.25, 0.85, -0.10, ...]
Cosine similarity: 0.97 (very similar!)
3. Return documents with the most similar vectors.
Embedding models: OpenAI text-embedding, Google Gecko, Cohere
Vector databases: Pinecone, Milvus, pgvector (PostgreSQL extension)
Hybrid approach (best of both worlds):
1. Keyword search: get top 1000 candidates from inverted index
2. Vector search: re-rank using semantic similarity
3. Combine scores: final_score = 0.7 * keyword + 0.3 * semantic
Step 8: Autocomplete
Autocomplete Architecture:
User types: "syst"
Suggestions: ["system design", "system design interview", "systems programming"]
Data Structure: Trie (prefix tree)
Trie:
s -> y -> s -> t -> e -> m
|
" " -> d -> e -> s -> i -> g -> n
|
" " -> i -> n -> t ...
Each node stores:
- Top 10 completions (pre-computed)
- Popularity score for each completion
When user types "syst":
Navigate to "syst" node in the trie.
Return the top 10 pre-computed suggestions.
Updating suggestions:
- Collect search query logs
- Count query frequency in the last 7 days
- Rebuild trie weekly with updated frequencies
- Use a two-trie approach: serve from one while rebuilding the other
Scale:
- Trie fits in memory: 5 million unique queries * 100 bytes = 500 MB
- Replicate across multiple servers for availability
- Partition by first letter for larger datasets
Step 9: System Architecture
Complete Architecture:
Crawling Pipeline:
[URL Frontier (Redis)] --> [Crawler Workers (1000s)]
| |
[Politeness Scheduler] [Downloaded Pages]
|
[Content Store (S3)]
|
[Dedup (Bloom Filter + SimHash)]
|
[Indexer Pipeline (Kafka + Spark)]
|
[Inverted Index (distributed)]
Serving Pipeline:
[User Query]
|
[Query Service (stateless)]
|
[Query Parser + Spell Check]
|
[Index Servers (sharded by term range)]
|
[Candidate Retrieval (top 1000)]
|
[Re-Ranker (ML model)]
|
[Result Assembly + Snippet Generation]
|
[Return top 10 results]
Supporting:
[PageRank Computer (batch, daily)]
[Autocomplete Service (Trie in memory)]
[Vector Search (Pinecone/Milvus)]
[Analytics (search logs, click data)]
Index Sharding
Index Sharding:
Option 1: Document-partitioned index
Each shard holds the complete inverted index for a subset of documents.
Query is sent to ALL shards. Results are merged.
Shard 1: documents 1-1M (complete index for these docs)
Shard 2: documents 1M-2M (complete index for these docs)
Pros: each shard is independent, easy to add shards
Cons: query goes to all shards (scatter-gather)
Option 2: Term-partitioned index
Each shard holds the posting lists for a subset of terms.
Query terms are routed to specific shards.
Shard 1: terms starting with A-M
Shard 2: terms starting with N-Z
Pros: query hits fewer shards (only the relevant ones)
Cons: multi-term queries still hit multiple shards
Google uses document-partitioned with 100,000s of shards.
Each shard is replicated for availability and load balancing.
Common Mistakes
Ignoring the crawler. Many candidates jump straight to the index. The crawler is a critical component — discuss URL frontier, politeness, and dedup.
Not mentioning PageRank or link analysis. Keyword matching alone produces poor results. Link analysis is essential for quality ranking.
Forgetting about freshness. A search engine with stale data is useless. Discuss re-crawl scheduling: news sites every hour, blogs every day, static pages every week.
Treating it as a single query to a single index. At web scale, the index is sharded across thousands of machines. Discuss how queries are distributed and results merged.
Interview Tips
Split into three components. “A search engine has three main parts: crawling, indexing, and serving. Let me design each.”
Draw the inverted index. Show a simple example: word to document mapping. This demonstrates you understand the core data structure.
Discuss ranking layers. “First, I use TF-IDF for candidate retrieval. Then PageRank for authority. Then an ML model for final ranking.”
Mention scale numbers. “Google indexes 100B+ pages. The index is sharded across hundreds of thousands of servers.”
Bring up semantic search. “For modern search in 2026, I would add a vector search layer using embeddings for semantic understanding.”
Related Articles
- System Design #12: Data Partitioning — Sharding the index across machines
- System Design #4: Caching — Caching frequent queries
- System Design #3: Load Balancers — Distributing query traffic
What’s Next?
In the final article, System Design #20: Interview Tips and Cheat Sheet, you will learn:
- The 4-step interview framework
- Back-of-the-envelope estimation cheat sheet
- Common mistakes that fail candidates
- How to practice system design effectively
This is part 19 of the System Design Tutorial series. Follow along to learn system design from scratch.