In the previous article, you designed a notification system. Now let us tackle one of the most complex systems in computing: a search engine.

Search engines like Google index over 100 billion web pages and serve billions of queries per day. We will design a simplified version that covers the core components: crawling, indexing, ranking, and serving.

Step 1: Requirements

Functional Requirements

  1. Crawl the web and discover new pages
  2. Index page content for fast retrieval
  3. Search by keywords and return ranked results
  4. Autocomplete (search suggestions as you type)
  5. Return results in under 500ms

Non-Functional Requirements

  1. Fresh results — new content indexed within hours
  2. Relevant results — best pages ranked first
  3. Scale to 100 billion indexed pages
  4. Handle 10 billion search queries per day
  5. High availability — search must always work

Step 2: Estimation

Indexed pages: 100 billion
Average page size: 50 KB (text content after stripping HTML)
Total index storage: 100B * 50 KB = 5 PB (raw text)
Inverted index size: ~20% of raw text = 1 PB

Queries: 10 billion per day
QPS: 10B / 86,400 = ~115,000 queries/sec
Peak QPS: ~350,000 queries/sec

Crawling:
  New/updated pages per day: 1 billion
  Crawl rate: 1B / 86,400 = ~11,600 pages/sec
  Bandwidth: 11,600 * 100 KB (full page) = 1.16 GB/sec

Step 3: Web Crawler

The crawler discovers and downloads web pages. It is the first stage of the search engine pipeline.

Architecture

Crawler Architecture:

  [Seed URLs]
       |
  [URL Frontier] (priority queue of URLs to crawl)
       |
  [DNS Resolver] (resolve domain to IP address, cache results)
       |
  [Fetcher] (download the page via HTTP)
       |
  [HTML Parser] (extract text content and links)
       |
  +----+----+
  |         |
  v         v
[Content   [Extracted URLs]
 Store]         |
               [URL Filter]
               - Already crawled? (deduplicate)
               - Robots.txt allows? (politeness)
               - Same domain? (rate limit per domain)
                    |
               [URL Frontier] (add new URLs to crawl)

URL Frontier

The URL frontier is a priority queue that decides which URLs to crawl next.

URL Frontier Design:

  Two sub-queues:

  1. Priority Queue (what to crawl):
     - Important pages first (high PageRank, news sites, frequently updated)
     - New discoveries get medium priority
     - Deep pages get low priority

  2. Politeness Queue (when to crawl):
     - One queue per domain
     - Each domain has a crawl delay (usually 1-10 seconds between requests)
     - Respects robots.txt crawl-delay directive

  Example:
    google.com: crawl 1 page every 2 seconds
    smallblog.com: crawl 1 page every 10 seconds

  This prevents overwhelming websites with too many requests.

  robots.txt example:
    User-agent: *
    Disallow: /private/
    Crawl-delay: 5

    This tells crawlers: do not crawl /private/ pages,
    and wait at least 5 seconds between requests.

Deduplication

The crawler must avoid crawling the same page twice and avoid storing duplicate content.

URL Deduplication:

  Use a Bloom filter to check if a URL has already been crawled.
  Bloom filter: probabilistic data structure
    - "Definitely not seen" or "probably seen"
    - False positives are OK (skip a few pages unnecessarily)
    - False negatives never happen (never miss a new page)
    - 100 billion URLs in 10 GB of memory

Content Deduplication:

  Two different URLs can have the same content (mirrors, redirects).
  Use SimHash (locality-sensitive hashing):
    1. Compute SimHash of the page content
    2. Compare with existing SimHashes
    3. If similar (Hamming distance < 3): treat as duplicate

  SimHash can detect near-duplicates (pages with minor differences
  like different headers/footers but same main content).

Step 4: Indexing

The indexer processes crawled pages and builds a searchable data structure.

Inverted Index

The inverted index is the core data structure of a search engine. It maps every word to the list of documents that contain it.

Inverted Index:

  Forward index (what words are in each document):
    doc_1: ["system", "design", "tutorial", "learn"]
    doc_2: ["design", "patterns", "java"]
    doc_3: ["system", "architecture", "scalable"]

  Inverted index (which documents contain each word):
    "system":       [doc_1, doc_3]
    "design":       [doc_1, doc_2]
    "tutorial":     [doc_1]
    "learn":        [doc_1]
    "patterns":     [doc_2]
    "java":         [doc_2]
    "architecture": [doc_3]
    "scalable":     [doc_3]

  Query: "system design"
    Look up "system":  [doc_1, doc_3]
    Look up "design":  [doc_1, doc_2]
    Intersection:      [doc_1]  <-- contains both words

  Each entry also stores:
    - Position of the word in the document (for phrase matching)
    - Term frequency (how often the word appears)
    - Document metadata (title, URL, page rank)

Posting List

Each word in the inverted index points to a posting list — a sorted list of document IDs with metadata.

Posting List for "system":

  [
    { doc_id: 1, tf: 3, positions: [0, 15, 42] },
    { doc_id: 3, tf: 1, positions: [0] },
    { doc_id: 7, tf: 5, positions: [2, 8, 23, 45, 67] },
    ...
  ]

  tf (term frequency): how many times "system" appears in the document
  positions: where in the document (for phrase matching)

  Sorted by doc_id for fast intersection of multiple posting lists.

  Compression:
    Posting lists can be huge (common words like "the" appear in billions of docs).
    Use delta encoding + variable-byte encoding:
      Original:   [1, 3, 7, 15, 22]
      Delta:      [1, 2, 4, 8, 7]
      VarByte:    smaller binary representation

    This reduces index size by 50-70%.

Index Building Pipeline

Indexing Pipeline:

  [Crawled Pages]
       |
  [Text Extraction]
    - Strip HTML tags
    - Extract title, headings, body text, alt text
    - Detect language
       |
  [Tokenization]
    - Split text into words
    - "System Design Tutorial" --> ["system", "design", "tutorial"]
       |
  [Normalization]
    - Lowercase: "System" --> "system"
    - Stemming: "running" --> "run", "designs" --> "design"
    - Remove stop words: "the", "is", "a" (optional, modern engines keep them)
       |
  [Index Builder]
    - For each word: add (doc_id, position) to the posting list
    - Build in batches, then merge into the main index
       |
  [Inverted Index Store]

Step 5: Query Processing

When a user searches, the query goes through several stages.

Query Processing Pipeline:

  User types: "best system design books 2026"
       |
  [Query Parser]
    - Tokenize: ["best", "system", "design", "books", "2026"]
    - Normalize: lowercase, stem
    - Detect intent: informational query
    - Spell check: "desing" --> "design" (did you mean?)
       |
  [Query Expansion]
    - Synonyms: "books" --> also match "book", "textbook"
    - Related terms: "system design" --> also consider "software architecture"
       |
  [Index Lookup]
    - Look up each term in the inverted index
    - Get posting lists for each term
    - Intersect posting lists (AND) or union (OR)
       |
  [Scoring / Ranking]
    - Score each matching document
    - Rank by relevance score
       |
  [Result Assembly]
    - Fetch document metadata (title, URL, snippet)
    - Generate snippet (relevant excerpt with query terms highlighted)
    - Return top 10 results
       |
  [User sees results]

  Total time: < 500ms

Step 6: Ranking Algorithms

TF-IDF (Classic Approach)

TF-IDF (Term Frequency - Inverse Document Frequency):

  TF (Term Frequency):
    How often a word appears in a document.
    tf("design", doc_1) = 3 occurrences / 100 total words = 0.03

  IDF (Inverse Document Frequency):
    How rare a word is across all documents.
    "the" appears in 99% of documents --> low IDF (not useful for ranking)
    "kubernetes" appears in 0.1% --> high IDF (very useful)
    idf("design") = log(100B total docs / 500M docs with "design") = 2.3

  TF-IDF Score:
    score = tf * idf
    Higher score = word is frequent in this document AND rare overall
    This means the document is very relevant to that word.

  Multi-word query:
    score(doc, query) = sum of TF-IDF scores for each query term
PageRank (simplified):

  The idea: a page is important if important pages link to it.

  Every page starts with a rank of 1.
  In each iteration, each page distributes its rank to
  all pages it links to.

  Example:
    Page A links to Page B and Page C.
    Page A has rank 1.
    Page A gives 0.5 rank to B and 0.5 rank to C.

  After many iterations, pages with many inbound links
  from important pages have the highest rank.

  Formula:
    PR(page) = (1-d) + d * sum(PR(linking_page) / outbound_links)
    where d = damping factor (~0.85)

  Real-world:
    Google runs PageRank on the entire web graph (100B+ pages).
    This is a massive distributed computation (MapReduce).

  PageRank is combined with TF-IDF:
    final_score = relevance_score (TF-IDF) * authority_score (PageRank)

Modern Ranking (ML-Based)

Modern Search Ranking (2026):

  TF-IDF and PageRank are the foundation, but modern search engines
  use machine learning models for ranking.

  Two-stage ranking:

  Stage 1: Candidate Retrieval (fast, simple)
    - Use inverted index + TF-IDF to find top 1000 candidates
    - Must be fast: < 50ms

  Stage 2: Re-Ranking (slower, more accurate)
    - ML model scores each of the 1000 candidates
    - Features: TF-IDF score, PageRank, freshness, user location,
      click-through rate, dwell time, page quality
    - Model: gradient-boosted trees (LambdaMART) or neural network (BERT)
    - Return top 10 results

  Google uses BERT and MUM for understanding query intent:
    Query: "can I use olive oil as medicine"
    Old engines: match "olive oil" and "medicine" literally
    BERT: understands the semantic meaning, matches relevant health articles

Traditional search matches keywords. Semantic search understands meaning.

Vector Search (2026 approach):

  Traditional keyword search:
    Query: "how to make my website faster"
    Matches documents containing these exact words.
    Misses: "web performance optimization techniques"

  Semantic search with embeddings:
    1. Convert query to a vector (embedding):
       "how to make my website faster" --> [0.23, 0.87, -0.12, ...]

    2. Compare with pre-computed document vectors:
       "web performance optimization" --> [0.25, 0.85, -0.10, ...]
       Cosine similarity: 0.97 (very similar!)

    3. Return documents with the most similar vectors.

  Embedding models: OpenAI text-embedding, Google Gecko, Cohere
  Vector databases: Pinecone, Milvus, pgvector (PostgreSQL extension)

  Hybrid approach (best of both worlds):
    1. Keyword search: get top 1000 candidates from inverted index
    2. Vector search: re-rank using semantic similarity
    3. Combine scores: final_score = 0.7 * keyword + 0.3 * semantic

Step 8: Autocomplete

Autocomplete Architecture:

  User types: "syst"
  Suggestions: ["system design", "system design interview", "systems programming"]

  Data Structure: Trie (prefix tree)

  Trie:
    s -> y -> s -> t -> e -> m
                              |
                         " " -> d -> e -> s -> i -> g -> n
                                                          |
                                                     " " -> i -> n -> t ...

  Each node stores:
    - Top 10 completions (pre-computed)
    - Popularity score for each completion

  When user types "syst":
    Navigate to "syst" node in the trie.
    Return the top 10 pre-computed suggestions.

  Updating suggestions:
    - Collect search query logs
    - Count query frequency in the last 7 days
    - Rebuild trie weekly with updated frequencies
    - Use a two-trie approach: serve from one while rebuilding the other

  Scale:
    - Trie fits in memory: 5 million unique queries * 100 bytes = 500 MB
    - Replicate across multiple servers for availability
    - Partition by first letter for larger datasets

Step 9: System Architecture

Complete Architecture:

  Crawling Pipeline:
    [URL Frontier (Redis)] --> [Crawler Workers (1000s)]
         |                          |
    [Politeness Scheduler]    [Downloaded Pages]
                                    |
                              [Content Store (S3)]
                                    |
                              [Dedup (Bloom Filter + SimHash)]
                                    |
                              [Indexer Pipeline (Kafka + Spark)]
                                    |
                              [Inverted Index (distributed)]

  Serving Pipeline:
    [User Query]
         |
    [Query Service (stateless)]
         |
    [Query Parser + Spell Check]
         |
    [Index Servers (sharded by term range)]
         |
    [Candidate Retrieval (top 1000)]
         |
    [Re-Ranker (ML model)]
         |
    [Result Assembly + Snippet Generation]
         |
    [Return top 10 results]

  Supporting:
    [PageRank Computer (batch, daily)]
    [Autocomplete Service (Trie in memory)]
    [Vector Search (Pinecone/Milvus)]
    [Analytics (search logs, click data)]

Index Sharding

Index Sharding:

  Option 1: Document-partitioned index
    Each shard holds the complete inverted index for a subset of documents.
    Query is sent to ALL shards. Results are merged.

    Shard 1: documents 1-1M (complete index for these docs)
    Shard 2: documents 1M-2M (complete index for these docs)

    Pros: each shard is independent, easy to add shards
    Cons: query goes to all shards (scatter-gather)

  Option 2: Term-partitioned index
    Each shard holds the posting lists for a subset of terms.
    Query terms are routed to specific shards.

    Shard 1: terms starting with A-M
    Shard 2: terms starting with N-Z

    Pros: query hits fewer shards (only the relevant ones)
    Cons: multi-term queries still hit multiple shards

  Google uses document-partitioned with 100,000s of shards.
  Each shard is replicated for availability and load balancing.

Common Mistakes

  1. Ignoring the crawler. Many candidates jump straight to the index. The crawler is a critical component — discuss URL frontier, politeness, and dedup.

  2. Not mentioning PageRank or link analysis. Keyword matching alone produces poor results. Link analysis is essential for quality ranking.

  3. Forgetting about freshness. A search engine with stale data is useless. Discuss re-crawl scheduling: news sites every hour, blogs every day, static pages every week.

  4. Treating it as a single query to a single index. At web scale, the index is sharded across thousands of machines. Discuss how queries are distributed and results merged.

Interview Tips

  1. Split into three components. “A search engine has three main parts: crawling, indexing, and serving. Let me design each.”

  2. Draw the inverted index. Show a simple example: word to document mapping. This demonstrates you understand the core data structure.

  3. Discuss ranking layers. “First, I use TF-IDF for candidate retrieval. Then PageRank for authority. Then an ML model for final ranking.”

  4. Mention scale numbers. “Google indexes 100B+ pages. The index is sharded across hundreds of thousands of servers.”

  5. Bring up semantic search. “For modern search in 2026, I would add a vector search layer using embeddings for semantic understanding.”

What’s Next?

In the final article, System Design #20: Interview Tips and Cheat Sheet, you will learn:

  • The 4-step interview framework
  • Back-of-the-envelope estimation cheat sheet
  • Common mistakes that fail candidates
  • How to practice system design effectively

This is part 19 of the System Design Tutorial series. Follow along to learn system design from scratch.