Python Tutorial #24: Build a Web Scraper — BeautifulSoup and httpx

In the previous tutorial, we built a REST API with FastAPI. Now let’s build a web scraper — a program that extracts data from web pages automatically.

We will use httpx to fetch pages and BeautifulSoup to parse HTML. By the end, you will know how to extract data, save it to JSON and CSV, and scrape responsibly.

When to Scrape (and When Not To)

Web scraping is useful for:

Collecting data that is not available via an API
Monitoring prices, job listings, or news
Research and data analysis

But check these first:

Check for an API — many websites have a public API. Use it instead of scraping.
Read the Terms of Service — some websites prohibit scraping.
Check robots.txt — visit https://example.com/robots.txt to see what paths are allowed.
Do not scrape personal data — GDPR and similar laws apply.
Be respectful — add delays between requests. Do not overload servers.

For this tutorial, we use https://books.toscrape.com/ — a website designed specifically for scraping practice.

Install Dependencies

pip install httpx beautifulsoup4 lxml

httpx — makes HTTP requests (sync and async)
beautifulsoup4 — parses HTML
lxml — fast HTML parser (used by BeautifulSoup)

Fetching a Web Page

import httpx

response = httpx.get(
    "https://books.toscrape.com/",
    timeout=10.0,
    headers={"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"},
)

print(response.status_code)  # 200
print(len(response.text))    # HTML content

Always set a timeout and a User-Agent header. Some websites block requests without a User-Agent.

Parsing HTML with BeautifulSoup

from bs4 import BeautifulSoup

html = response.text
soup = BeautifulSoup(html, "lxml")

# Get the page title
title = soup.find("title").get_text(strip=True)
print(title)  # All products | Books to Scrape - Sandbox

BeautifulSoup(html, "lxml") parses the HTML and creates a tree structure you can search.

Finding Elements

find() and find_all()

# Find the first h1 tag
h1 = soup.find("h1")
print(h1.get_text())  # All products

# Find all product titles
titles = soup.find_all("h3")
for title_tag in titles[:5]:
    a_tag = title_tag.find("a")
    print(a_tag["title"])  # Book title from the title attribute

CSS Selectors with select()

CSS selectors are often easier than find(). If you know CSS, you already know how to use them:

# Select all product titles
titles = soup.select("article.product_pod h3 a")
for a in titles[:5]:
    print(a["title"])

# Select all prices
prices = soup.select("p.price_color")
for p in prices[:5]:
    print(p.get_text(strip=True))  # £51.77

# Select by ID
main = soup.select_one("#default")

# Select by attribute
links = soup.select('a[href*="catalogue"]')

Common CSS selectors:

Selector	Meaning
`tag`	All `<tag>` elements
`.class`	Elements with class
`#id`	Element with ID
`tag.class`	`<tag>` with class
`div > p`	Direct child `<p>` of `<div>`
`div p`	Any `<p>` inside `<div>`
`a[href]`	`<a>` with href attribute

Extracting Book Data

Here is a complete function that extracts book data from a page:

from dataclasses import dataclass, asdict

@dataclass
class Book:
    title: str
    price: float
    rating: int
    availability: str
    url: str = ""

RATING_MAP = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}

def extract_books(html: str) -> list[Book]:
    """Extract book data from HTML."""
    soup = BeautifulSoup(html, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        # Title — from the <a> title attribute
        title_tag = article.select_one("h3 a")
        title = title_tag.get("title", "") if title_tag else ""

        # Price — strip currency symbol
        price_tag = article.select_one("p.price_color")
        price_text = price_tag.get_text(strip=True) if price_tag else "0"
        price = parse_price(price_text)

        # Rating — from CSS class name
        rating_tag = article.select_one("p.star-rating")
        rating = 0
        if rating_tag:
            for cls in rating_tag.get("class", []):
                if cls.lower() in RATING_MAP:
                    rating = RATING_MAP[cls.lower()]

        # Availability
        avail_tag = article.select_one("p.availability")
        availability = avail_tag.get_text(strip=True) if avail_tag else "Unknown"

        books.append(Book(
            title=title, price=price, rating=rating, availability=availability
        ))

    return books

def parse_price(text: str) -> float:
    """Parse '$25.99' or '£25.99' to 25.99."""
    cleaned = "".join(c for c in text if c.isdigit() or c == ".")
    try:
        return float(cleaned)
    except ValueError:
        return 0.0

Saving Data to JSON

import json
from pathlib import Path

def save_to_json(data: list[dict], filepath: str) -> None:
    """Save data to a JSON file."""
    Path(filepath).write_text(json.dumps(data, indent=2))

# Usage
books = extract_books(html)
save_to_json([asdict(b) for b in books], "books.json")

Saving Data to CSV

import csv

def save_to_csv(data: list[dict], filepath: str) -> None:
    """Save data to a CSV file."""
    if not data:
        return
    fieldnames = list(data[0].keys())
    with open(filepath, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

save_to_csv([asdict(b) for b in books], "books.csv")

Extracting HTML Tables

Many websites display data in HTML tables. Here is a reusable function:

def extract_table(soup: BeautifulSoup, selector: str = "table") -> list[dict]:
    """Extract data from an HTML table into a list of dicts."""
    table = soup.select_one(selector)
    if table is None:
        return []

    # First row = headers
    headers = [th.get_text(strip=True) for th in table.select("tr:first-child th, tr:first-child td")]

    # Remaining rows = data
    rows = []
    for tr in table.select("tr")[1:]:
        cells = [td.get_text(strip=True) for td in tr.select("td")]
        if cells and headers:
            rows.append(dict(zip(headers, cells)))
    return rows

Rate Limiting

Always add delays between requests to avoid overloading the server:

import time

class ScraperRateLimiter:
    """Enforce a delay between requests."""

    def __init__(self, delay_seconds: float = 1.0) -> None:
        self.delay = delay_seconds
        self._last_time = 0.0

    def wait(self) -> None:
        elapsed = time.monotonic() - self._last_time
        if elapsed < self.delay and self._last_time > 0:
            time.sleep(self.delay - elapsed)
        self._last_time = time.monotonic()

Usage:

limiter = ScraperRateLimiter(delay_seconds=1.0)

for page in range(1, 6):
    limiter.wait()
    response = httpx.get(f"https://books.toscrape.com/catalogue/page-{page}.html")
    books = extract_books(response.text)
    print(f"Page {page}: {len(books)} books")

Checking robots.txt

Always check robots.txt before scraping:

def check_robots_txt(robots_text: str, path: str) -> bool:
    """Check if a path is allowed by robots.txt."""
    for line in robots_text.splitlines():
        line = line.strip()
        if line.lower().startswith("disallow:"):
            disallowed = line.split(":", 1)[1].strip()
            if disallowed and path.startswith(disallowed):
                return False
    return True

# Usage
robots = httpx.get("https://books.toscrape.com/robots.txt").text
print(check_robots_txt(robots, "/catalogue/"))  # True

For production, use Python’s built-in urllib.robotparser module for a more complete implementation.

Complete Scraping Script

Here is the full script that scrapes multiple pages:

import httpx
import time
from pathlib import Path

def scrape_books(pages: int = 5) -> list[Book]:
    """Scrape books from multiple pages."""
    all_books = []
    limiter = ScraperRateLimiter(delay_seconds=1.0)

    with httpx.Client(
        timeout=10.0,
        headers={"User-Agent": "PythonTutorial/1.0"},
    ) as client:
        for page in range(1, pages + 1):
            limiter.wait()
            url = f"https://books.toscrape.com/catalogue/page-{page}.html"

            try:
                response = client.get(url)
                response.raise_for_status()
            except httpx.HTTPError as e:
                print(f"Error on page {page}: {e}")
                continue

            books = extract_books(response.text)
            all_books.extend(books)
            print(f"Page {page}: {len(books)} books")

    return all_books

# Run the scraper
books = scrape_books(pages=3)
print(f"\nTotal: {len(books)} books")

# Save results
save_to_json([asdict(b) for b in books], "books.json")
save_to_csv([asdict(b) for b in books], "books.csv")
print("Saved to books.json and books.csv")

Handling Errors

Things will go wrong. Handle them gracefully:

import httpx

def safe_fetch(url: str, client: httpx.Client) -> str | None:
    """Fetch a URL with error handling."""
    try:
        response = client.get(url)
        response.raise_for_status()
        return response.text
    except httpx.TimeoutException:
        print(f"Timeout: {url}")
        return None
    except httpx.HTTPStatusError as e:
        print(f"HTTP error {e.response.status_code}: {url}")
        return None
    except httpx.RequestError as e:
        print(f"Network error: {url} — {e}")
        return None

Dynamic Content (When BeautifulSoup is Not Enough)

BeautifulSoup only parses the HTML that the server sends. If a website loads content with JavaScript (like single-page apps), BeautifulSoup will not see that content.

For JavaScript-rendered pages, use one of these tools:

Playwright — automates a real browser (recommended)
Selenium — older alternative to Playwright

# Example with Playwright (not covered in this tutorial)
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    page.wait_for_selector(".dynamic-content")
    html = page.content()
    browser.close()

For most websites (including books.toscrape.com), BeautifulSoup is enough.

Common Mistakes

1. Ignoring robots.txt

# BAD — scraping without checking
httpx.get("https://example.com/private/data")

# GOOD — check first
robots = httpx.get("https://example.com/robots.txt").text
if check_robots_txt(robots, "/private/data"):
    httpx.get("https://example.com/private/data")

2. No Rate Limiting

# BAD — hammering the server
for i in range(1000):
    httpx.get(f"https://example.com/page/{i}")

# GOOD — add delays
limiter = ScraperRateLimiter(delay_seconds=1.0)
for i in range(1000):
    limiter.wait()
    httpx.get(f"https://example.com/page/{i}")

3. No Error Handling

# BAD — crashes on missing elements
title = soup.select_one("h3 a")["title"]  # KeyError if no title!

# GOOD — handle missing elements
title_tag = soup.select_one("h3 a")
title = title_tag.get("title", "") if title_tag else ""

Source Code

You can find all the code from this tutorial on GitHub:

GitHub: python-tutorial/tutorial-24-scraper

What’s Next?

In the next tutorial, we will build an automation script — a file organizer that watches a folder, sorts files by type, sends reports, and runs on a schedule.

Python Tutorial #19: HTTP and APIs — httpx fundamentals
Python Tutorial #12: File I/O — reading and writing JSON/CSV
Python Tutorial #18: Async/Await — async scraping for speed

When to Scrape (and When Not To)#

Install Dependencies#

Fetching a Web Page#

Parsing HTML with BeautifulSoup#

Finding Elements#

find() and find_all()#

CSS Selectors with select()#

Extracting Book Data#

Saving Data to JSON#

Saving Data to CSV#

Extracting HTML Tables#

Rate Limiting#

Checking robots.txt#

Complete Scraping Script#

Handling Errors#

Dynamic Content (When BeautifulSoup is Not Enough)#

Common Mistakes#

1. Ignoring robots.txt#

2. No Rate Limiting#

3. No Error Handling#

Source Code#

What’s Next?#

Related Articles#