In the previous article, you designed a notification system. Now let us tackle one of the most complex systems in computing: a search engine.
Search engines like Google index over 100 billion web pages and serve billions of queries per day. We will design a simplified version that covers the core components: crawling, indexing, ranking, and serving.
Step 1: Requirements Functional Requirements Crawl the web and discover new pages Index page content for fast retrieval Search by keywords and return ranked results Autocomplete (search suggestions as you type) Return results in under 500ms Non-Functional Requirements Fresh results — new content indexed within hours Relevant results — best pages ranked first Scale to 100 billion indexed pages Handle 10 billion search queries per day High availability — search must always work Step 2: Estimation Indexed pages: 100 billion Average page size: 50 KB (text content after stripping HTML) Total index storage: 100B * 50 KB = 5 PB (raw text) Inverted index size: ~20% of raw text = 1 PB Queries: 10 billion per day QPS: 10B / 86,400 = ~115,000 queries/sec Peak QPS: ~350,000 queries/sec Crawling: New/updated pages per day: 1 billion Crawl rate: 1B / 86,400 = ~11,600 pages/sec Bandwidth: 11,600 * 100 KB (full page) = 1.16 GB/sec Step 3: Web Crawler The crawler discovers and downloads web pages. It is the first stage of the search engine pipeline.
...