Claude knows a lot, but it does not know your company documents, your codebase, or your private data. RAG (Retrieval-Augmented Generation) fixes this. You give Claude relevant context from your own documents, and it answers questions based on that context.
This is Article 17 in the Claude AI — From Zero to Power User series. You should know the Messages API before this article.
By the end, you will build a working RAG pipeline that answers questions from your own documents.
What is RAG?
RAG stands for Retrieval-Augmented Generation. The idea is simple:
- Store your documents in a searchable database
- Retrieve the most relevant pieces when a user asks a question
- Generate an answer using Claude with those pieces as context
Without RAG, Claude can only use its training data. With RAG, Claude can answer questions about any document you provide — company policies, product manuals, research papers, or codebases.
Why RAG Instead of Putting Everything in the Prompt?
Claude has a 200K token context window (1M in beta). Why not just send all your documents in every request?
Three reasons:
- Cost — Sending 100K tokens of context on every request is expensive. With RAG, you only send the 2-5 relevant chunks (2,000-5,000 tokens).
- Accuracy — Claude performs better with focused, relevant context than with a massive dump of everything.
- Scale — RAG works with millions of documents. You cannot fit millions of documents in one prompt.
RAG Architecture
A RAG system has two phases:
Ingestion Phase (Once)
Document → Parse → Chunk → Embed → Store in Vector DB
Query Phase (Every Question)
Question → Embed → Search Vector DB → Get Top Chunks → Send to Claude → Answer
Let us build both phases step by step.
Step 1: Chunking Documents
Large documents need to be split into smaller pieces (chunks). Claude answers better when it gets focused, relevant chunks instead of entire documents.
Python
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
"""Split text into overlapping chunks by character count."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to break at a sentence boundary
if end < len(text):
last_period = chunk.rfind(".")
last_newline = chunk.rfind("\n")
break_point = max(last_period, last_newline)
if break_point > chunk_size * 0.5:
chunk = chunk[: break_point + 1]
end = start + break_point + 1
chunks.append(chunk.strip())
start = end - overlap
return chunks
# Example
document = open("company-handbook.txt").read()
chunks = chunk_text(document, chunk_size=500, overlap=100)
print(f"Created {len(chunks)} chunks from document")
TypeScript
function chunkText(
text: string,
chunkSize: number = 500,
overlap: number = 100
): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
let end = start + chunkSize;
let chunk = text.slice(start, end);
// Try to break at a sentence boundary
if (end < text.length) {
const lastPeriod = chunk.lastIndexOf(".");
const lastNewline = chunk.lastIndexOf("\n");
const breakPoint = Math.max(lastPeriod, lastNewline);
if (breakPoint > chunkSize * 0.5) {
chunk = chunk.slice(0, breakPoint + 1);
end = start + breakPoint + 1;
}
}
chunks.push(chunk.trim());
start = end - overlap;
}
return chunks;
}
Chunking Strategy
| Strategy | Chunk Size | Overlap | Best For |
|---|---|---|---|
| Small chunks | 200-500 chars | 50-100 | Precise Q&A, short documents |
| Medium chunks | 500-1000 chars | 100-200 | General purpose, most documents |
| Large chunks | 1000-2000 chars | 200-400 | Long-form analysis, research papers |
Start with 500 characters and 100 overlap. Adjust based on your results.
Step 2: Creating Embeddings
Embeddings convert text into numbers (vectors). Similar text produces similar vectors. This is how we search for relevant chunks later.
Python (Using OpenAI Embeddings)
from openai import OpenAI
openai_client = OpenAI()
def get_embeddings(texts: list[str]) -> list[list[float]]:
"""Get embeddings for a list of texts."""
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
# Embed all chunks
chunks = chunk_text(document)
embeddings = get_embeddings(chunks)
print(f"Created {len(embeddings)} embeddings, dimension: {len(embeddings[0])}")
TypeScript
import OpenAI from "openai";
const openai = new OpenAI();
async function getEmbeddings(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data.map((item) => item.embedding);
}
You can also use open-source embedding models like sentence-transformers in Python. OpenAI embeddings are used here because they are simple and widely available.
Step 3: Storing in a Vector Database
Vector databases store embeddings and let you search by similarity. We will use ChromaDB for local development and mention pgvector for production.
Python (ChromaDB)
import chromadb
# Create a persistent client
chroma = chromadb.PersistentClient(path="./chroma_db")
# Create a collection
collection = chroma.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"},
)
def ingest_document(doc_path: str):
"""Parse, chunk, and store a document."""
with open(doc_path) as f:
text = f.read()
chunks = chunk_text(text, chunk_size=500, overlap=100)
embeddings = get_embeddings(chunks)
# Store chunks with their embeddings
collection.add(
ids=[f"{doc_path}_{i}" for i in range(len(chunks))],
embeddings=embeddings,
documents=chunks,
metadatas=[{"source": doc_path, "chunk_index": i} for i in range(len(chunks))],
)
print(f"Ingested {len(chunks)} chunks from {doc_path}")
# Ingest your documents
ingest_document("company-handbook.txt")
ingest_document("product-manual.txt")
TypeScript (ChromaDB)
import { ChromaClient } from "chromadb";
import { readFileSync } from "fs";
const chroma = new ChromaClient();
async function ingestDocument(docPath: string): Promise<void> {
const text = readFileSync(docPath, "utf-8");
const chunks = chunkText(text, 500, 100);
const embeddings = await getEmbeddings(chunks);
const collection = await chroma.getOrCreateCollection({
name: "documents",
metadata: { "hnsw:space": "cosine" },
});
await collection.add({
ids: chunks.map((_, i) => `${docPath}_${i}`),
embeddings,
documents: chunks,
metadatas: chunks.map((_, i) => ({ source: docPath, chunk_index: i })),
});
console.log(`Ingested ${chunks.length} chunks from ${docPath}`);
}
pgvector for Production
For production, pgvector is a better choice. It runs inside PostgreSQL, so you get a real database with ACID transactions, backups, and SQL queries.
-- Enable the extension
CREATE EXTENSION vector;
-- Create the table
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
source TEXT NOT NULL,
chunk_index INTEGER,
embedding vector(1536)
);
-- Create an index for fast similarity search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops);
-- Search for similar chunks
SELECT content, source, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
ORDER BY embedding <=> $1
LIMIT 5;
Step 4: Querying with Claude
Now we combine retrieval and generation. Search for relevant chunks, then send them to Claude with the question.
Python
import anthropic
claude = anthropic.Anthropic()
def ask_question(question: str, top_k: int = 5) -> str:
"""Ask a question and get an answer based on your documents."""
# Step 1: Get the question embedding
question_embedding = get_embeddings([question])[0]
# Step 2: Search for relevant chunks
results = collection.query(
query_embeddings=[question_embedding],
n_results=top_k,
)
chunks = results["documents"][0]
sources = results["metadatas"][0]
# Step 3: Build the context
context = "\n\n---\n\n".join(
[f"[Source: {s['source']}, Chunk {s['chunk_index']}]\n{chunk}"
for chunk, s in zip(chunks, sources)]
)
# Step 4: Ask Claude
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system="""You are a helpful assistant that answers questions based on the provided context.
<rules>
- Only answer based on the provided context
- If the context does not contain the answer, say "I could not find this information in the provided documents"
- Cite your sources by mentioning the document name
- Be concise and direct
</rules>""",
messages=[
{
"role": "user",
"content": f"""<context>
{context}
</context>
<question>
{question}
</question>
Answer this question based on the context above.""",
}
],
)
return response.content[0].text
# Ask questions
answer = ask_question("What is the company vacation policy?")
print(answer)
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const claude = new Anthropic();
async function askQuestion(
question: string,
topK: number = 5
): Promise<string> {
// Step 1: Get the question embedding
const questionEmbedding = (await getEmbeddings([question]))[0];
// Step 2: Search for relevant chunks
const collection = await chroma.getCollection({ name: "documents" });
const results = await collection.query({
queryEmbeddings: [questionEmbedding],
nResults: topK,
});
const chunks = results.documents[0];
const sources = results.metadatas[0];
// Step 3: Build context
const context = chunks
.map(
(chunk, i) =>
`[Source: ${sources[i]?.source}, Chunk ${sources[i]?.chunk_index}]\n${chunk}`
)
.join("\n\n---\n\n");
// Step 4: Ask Claude
const response = await claude.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: `You are a helpful assistant that answers questions based on the provided context.
<rules>
- Only answer based on the provided context
- If the context does not contain the answer, say "I could not find this information in the provided documents"
- Cite your sources by mentioning the document name
- Be concise and direct
</rules>`,
messages: [
{
role: "user",
content: `<context>\n${context}\n</context>\n\n<question>\n${question}\n</question>\n\nAnswer this question based on the context above.`,
},
],
});
if (response.content[0].type === "text") {
return response.content[0].text;
}
return "";
}
Contextual Retrieval
Anthropic published a technique called Contextual Retrieval that reduces failed retrievals by 49%. The idea is simple: before embedding each chunk, use Claude to add context about where the chunk fits in the document.
Python
def add_context_to_chunks(document: str, chunks: list[str]) -> list[str]:
"""Add document context to each chunk for better retrieval."""
contextualized = []
for chunk in chunks:
response = claude.messages.create(
model="claude-haiku-4-5", # Use Haiku for cost efficiency
max_tokens=200,
messages=[
{
"role": "user",
"content": f"""<document>
{document[:10000]}
</document>
<chunk>
{chunk}
</chunk>
Write a short (1-2 sentence) context that explains where this chunk fits in the document. Start with "This chunk...".""",
}
],
)
context = response.content[0].text
contextualized.append(f"{context}\n\n{chunk}")
return contextualized
# Use contextual chunks for better embeddings
chunks = chunk_text(document)
contextual_chunks = add_context_to_chunks(document, chunks)
embeddings = get_embeddings(contextual_chunks)
This costs more during ingestion (one Haiku call per chunk), but significantly improves retrieval accuracy.
Hybrid Search
Combine embedding search with keyword search (BM25) for better results. Embedding search finds semantically similar content. Keyword search finds exact matches.
from rank_bm25 import BM25Okapi
def hybrid_search(question: str, top_k: int = 5) -> list[str]:
"""Combine vector search with BM25 keyword search."""
# Get all chunks and their IDs
all_data = collection.get()
all_ids = all_data["ids"]
all_chunks = all_data["documents"]
id_to_index = {doc_id: i for i, doc_id in enumerate(all_ids)}
# Vector search
question_embedding = get_embeddings([question])[0]
vector_results = collection.query(
query_embeddings=[question_embedding],
n_results=top_k * 2,
)
# BM25 keyword search
tokenized = [chunk.lower().split() for chunk in all_chunks]
bm25 = BM25Okapi(tokenized)
bm25_scores = bm25.get_scores(question.lower().split())
max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
# Combine scores (normalize and weight)
# 0.7 weight for vector, 0.3 for BM25
combined = {}
for i, doc_id in enumerate(vector_results["ids"][0]):
combined[doc_id] = 0.7 * (1 - i / len(vector_results["ids"][0]))
for i, doc_id in enumerate(all_ids):
normalized_bm25 = bm25_scores[i] / max_bm25
combined[doc_id] = combined.get(doc_id, 0) + 0.3 * normalized_bm25
# Return top-k by combined score
sorted_ids = sorted(combined, key=combined.get, reverse=True)[:top_k]
return [all_chunks[id_to_index[doc_id]] for doc_id in sorted_ids]
Prompt Caching for RAG
If your users ask many questions about the same documents, use prompt caching. Cache the system prompt and document context, and only vary the question:
# First request — writes to cache
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": "You are a document Q&A assistant. Answer only from the provided context.",
},
{
"type": "text",
"text": f"<documents>\n{all_document_context}\n</documents>",
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": "What is the vacation policy?"}],
)
# Subsequent requests — reads from cache (90% cheaper for cached tokens)
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": "You are a document Q&A assistant. Answer only from the provided context.",
},
{
"type": "text",
"text": f"<documents>\n{all_document_context}\n</documents>",
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": "How many sick days do employees get?"}],
)
The cached tokens cost 90% less on subsequent requests. This is very effective for RAG where the document context stays the same.
Cost Breakdown
| Operation | Model | Cost |
|---|---|---|
| Embed 100 chunks | text-embedding-3-small | ~$0.002 |
| Contextual retrieval (100 chunks) | Haiku 4.5 | ~$0.10 |
| Single query (5 chunks context) | Sonnet 4.6 | ~$0.02 |
| Single query (cached context) | Sonnet 4.6 | ~$0.005 |
For a typical document Q&A system with 100 pages and 50 queries per day, expect about $1-2/day with caching enabled.
Summary
| Concept | Details |
|---|---|
| RAG | Retrieve relevant context, then generate answers |
| Chunking | Split documents into 500-char pieces with overlap |
| Embeddings | Convert text to vectors for similarity search |
| Vector DB | ChromaDB (local) or pgvector (production) |
| Contextual retrieval | Add document context to chunks before embedding |
| Hybrid search | Combine vector + keyword search |
| Prompt caching | Cache document context for 90% cheaper queries |
What’s Next?
In the next article, we will cover prompt testing and evaluation — how to systematically improve your prompts with metrics and automated tests.
Next: Fine-Tuning Prompts — Evaluation and Testing