In the previous tutorial, we built a REST API with FastAPI. Now let’s build a web scraper — a program that extracts data from web pages automatically.
We will use httpx to fetch pages and BeautifulSoup to parse HTML. By the end, you will know how to extract data, save it to JSON and CSV, and scrape responsibly.
When to Scrape (and When Not To)
Web scraping is useful for:
- Collecting data that is not available via an API
- Monitoring prices, job listings, or news
- Research and data analysis
But check these first:
- Check for an API — many websites have a public API. Use it instead of scraping.
- Read the Terms of Service — some websites prohibit scraping.
- Check robots.txt — visit
https://example.com/robots.txtto see what paths are allowed. - Do not scrape personal data — GDPR and similar laws apply.
- Be respectful — add delays between requests. Do not overload servers.
For this tutorial, we use https://books.toscrape.com/ — a website designed specifically for scraping practice.
Install Dependencies
pip install httpx beautifulsoup4 lxml
httpx— makes HTTP requests (sync and async)beautifulsoup4— parses HTMLlxml— fast HTML parser (used by BeautifulSoup)
Fetching a Web Page
import httpx
response = httpx.get(
"https://books.toscrape.com/",
timeout=10.0,
headers={"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"},
)
print(response.status_code) # 200
print(len(response.text)) # HTML content
Always set a timeout and a User-Agent header. Some websites block requests without a User-Agent.
Parsing HTML with BeautifulSoup
from bs4 import BeautifulSoup
html = response.text
soup = BeautifulSoup(html, "lxml")
# Get the page title
title = soup.find("title").get_text(strip=True)
print(title) # All products | Books to Scrape - Sandbox
BeautifulSoup(html, "lxml") parses the HTML and creates a tree structure you can search.
Finding Elements
find() and find_all()
# Find the first h1 tag
h1 = soup.find("h1")
print(h1.get_text()) # All products
# Find all product titles
titles = soup.find_all("h3")
for title_tag in titles[:5]:
a_tag = title_tag.find("a")
print(a_tag["title"]) # Book title from the title attribute
CSS Selectors with select()
CSS selectors are often easier than find(). If you know CSS, you already know how to use them:
# Select all product titles
titles = soup.select("article.product_pod h3 a")
for a in titles[:5]:
print(a["title"])
# Select all prices
prices = soup.select("p.price_color")
for p in prices[:5]:
print(p.get_text(strip=True)) # £51.77
# Select by ID
main = soup.select_one("#default")
# Select by attribute
links = soup.select('a[href*="catalogue"]')
Common CSS selectors:
| Selector | Meaning |
|---|---|
tag | All <tag> elements |
.class | Elements with class |
#id | Element with ID |
tag.class | <tag> with class |
div > p | Direct child <p> of <div> |
div p | Any <p> inside <div> |
a[href] | <a> with href attribute |
Extracting Book Data
Here is a complete function that extracts book data from a page:
from dataclasses import dataclass, asdict
@dataclass
class Book:
title: str
price: float
rating: int
availability: str
url: str = ""
RATING_MAP = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}
def extract_books(html: str) -> list[Book]:
"""Extract book data from HTML."""
soup = BeautifulSoup(html, "lxml")
books = []
for article in soup.select("article.product_pod"):
# Title — from the <a> title attribute
title_tag = article.select_one("h3 a")
title = title_tag.get("title", "") if title_tag else ""
# Price — strip currency symbol
price_tag = article.select_one("p.price_color")
price_text = price_tag.get_text(strip=True) if price_tag else "0"
price = parse_price(price_text)
# Rating — from CSS class name
rating_tag = article.select_one("p.star-rating")
rating = 0
if rating_tag:
for cls in rating_tag.get("class", []):
if cls.lower() in RATING_MAP:
rating = RATING_MAP[cls.lower()]
# Availability
avail_tag = article.select_one("p.availability")
availability = avail_tag.get_text(strip=True) if avail_tag else "Unknown"
books.append(Book(
title=title, price=price, rating=rating, availability=availability
))
return books
def parse_price(text: str) -> float:
"""Parse '$25.99' or '£25.99' to 25.99."""
cleaned = "".join(c for c in text if c.isdigit() or c == ".")
try:
return float(cleaned)
except ValueError:
return 0.0
Saving Data to JSON
import json
from pathlib import Path
def save_to_json(data: list[dict], filepath: str) -> None:
"""Save data to a JSON file."""
Path(filepath).write_text(json.dumps(data, indent=2))
# Usage
books = extract_books(html)
save_to_json([asdict(b) for b in books], "books.json")
Saving Data to CSV
import csv
def save_to_csv(data: list[dict], filepath: str) -> None:
"""Save data to a CSV file."""
if not data:
return
fieldnames = list(data[0].keys())
with open(filepath, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
save_to_csv([asdict(b) for b in books], "books.csv")
Extracting HTML Tables
Many websites display data in HTML tables. Here is a reusable function:
def extract_table(soup: BeautifulSoup, selector: str = "table") -> list[dict]:
"""Extract data from an HTML table into a list of dicts."""
table = soup.select_one(selector)
if table is None:
return []
# First row = headers
headers = [th.get_text(strip=True) for th in table.select("tr:first-child th, tr:first-child td")]
# Remaining rows = data
rows = []
for tr in table.select("tr")[1:]:
cells = [td.get_text(strip=True) for td in tr.select("td")]
if cells and headers:
rows.append(dict(zip(headers, cells)))
return rows
Rate Limiting
Always add delays between requests to avoid overloading the server:
import time
class ScraperRateLimiter:
"""Enforce a delay between requests."""
def __init__(self, delay_seconds: float = 1.0) -> None:
self.delay = delay_seconds
self._last_time = 0.0
def wait(self) -> None:
elapsed = time.monotonic() - self._last_time
if elapsed < self.delay and self._last_time > 0:
time.sleep(self.delay - elapsed)
self._last_time = time.monotonic()
Usage:
limiter = ScraperRateLimiter(delay_seconds=1.0)
for page in range(1, 6):
limiter.wait()
response = httpx.get(f"https://books.toscrape.com/catalogue/page-{page}.html")
books = extract_books(response.text)
print(f"Page {page}: {len(books)} books")
Checking robots.txt
Always check robots.txt before scraping:
def check_robots_txt(robots_text: str, path: str) -> bool:
"""Check if a path is allowed by robots.txt."""
for line in robots_text.splitlines():
line = line.strip()
if line.lower().startswith("disallow:"):
disallowed = line.split(":", 1)[1].strip()
if disallowed and path.startswith(disallowed):
return False
return True
# Usage
robots = httpx.get("https://books.toscrape.com/robots.txt").text
print(check_robots_txt(robots, "/catalogue/")) # True
For production, use Python’s built-in urllib.robotparser module for a more complete implementation.
Complete Scraping Script
Here is the full script that scrapes multiple pages:
import httpx
import time
from pathlib import Path
def scrape_books(pages: int = 5) -> list[Book]:
"""Scrape books from multiple pages."""
all_books = []
limiter = ScraperRateLimiter(delay_seconds=1.0)
with httpx.Client(
timeout=10.0,
headers={"User-Agent": "PythonTutorial/1.0"},
) as client:
for page in range(1, pages + 1):
limiter.wait()
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
try:
response = client.get(url)
response.raise_for_status()
except httpx.HTTPError as e:
print(f"Error on page {page}: {e}")
continue
books = extract_books(response.text)
all_books.extend(books)
print(f"Page {page}: {len(books)} books")
return all_books
# Run the scraper
books = scrape_books(pages=3)
print(f"\nTotal: {len(books)} books")
# Save results
save_to_json([asdict(b) for b in books], "books.json")
save_to_csv([asdict(b) for b in books], "books.csv")
print("Saved to books.json and books.csv")
Handling Errors
Things will go wrong. Handle them gracefully:
import httpx
def safe_fetch(url: str, client: httpx.Client) -> str | None:
"""Fetch a URL with error handling."""
try:
response = client.get(url)
response.raise_for_status()
return response.text
except httpx.TimeoutException:
print(f"Timeout: {url}")
return None
except httpx.HTTPStatusError as e:
print(f"HTTP error {e.response.status_code}: {url}")
return None
except httpx.RequestError as e:
print(f"Network error: {url} — {e}")
return None
Dynamic Content (When BeautifulSoup is Not Enough)
BeautifulSoup only parses the HTML that the server sends. If a website loads content with JavaScript (like single-page apps), BeautifulSoup will not see that content.
For JavaScript-rendered pages, use one of these tools:
- Playwright — automates a real browser (recommended)
- Selenium — older alternative to Playwright
# Example with Playwright (not covered in this tutorial)
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
page.wait_for_selector(".dynamic-content")
html = page.content()
browser.close()
For most websites (including books.toscrape.com), BeautifulSoup is enough.
Common Mistakes
1. Ignoring robots.txt
# BAD — scraping without checking
httpx.get("https://example.com/private/data")
# GOOD — check first
robots = httpx.get("https://example.com/robots.txt").text
if check_robots_txt(robots, "/private/data"):
httpx.get("https://example.com/private/data")
2. No Rate Limiting
# BAD — hammering the server
for i in range(1000):
httpx.get(f"https://example.com/page/{i}")
# GOOD — add delays
limiter = ScraperRateLimiter(delay_seconds=1.0)
for i in range(1000):
limiter.wait()
httpx.get(f"https://example.com/page/{i}")
3. No Error Handling
# BAD — crashes on missing elements
title = soup.select_one("h3 a")["title"] # KeyError if no title!
# GOOD — handle missing elements
title_tag = soup.select_one("h3 a")
title = title_tag.get("title", "") if title_tag else ""
Source Code
You can find all the code from this tutorial on GitHub:
GitHub: python-tutorial/tutorial-24-scraper
What’s Next?
In the next tutorial, we will build an automation script — a file organizer that watches a folder, sorts files by type, sends reports, and runs on a schedule.
Related Articles
- Python Tutorial #19: HTTP and APIs — httpx fundamentals
- Python Tutorial #12: File I/O — reading and writing JSON/CSV
- Python Tutorial #18: Async/Await — async scraping for speed