Claude knows a lot, but it does not know your company documents, your codebase, or your private data. RAG (Retrieval-Augmented Generation) fixes this. You give Claude relevant context from your own documents, and it answers questions based on that context.

This is Article 17 in the Claude AI — From Zero to Power User series. You should know the Messages API before this article.

By the end, you will build a working RAG pipeline that answers questions from your own documents.


What is RAG?

RAG stands for Retrieval-Augmented Generation. The idea is simple:

  1. Store your documents in a searchable database
  2. Retrieve the most relevant pieces when a user asks a question
  3. Generate an answer using Claude with those pieces as context

Without RAG, Claude can only use its training data. With RAG, Claude can answer questions about any document you provide — company policies, product manuals, research papers, or codebases.

Why RAG Instead of Putting Everything in the Prompt?

Claude has a 200K token context window (1M in beta). Why not just send all your documents in every request?

Three reasons:

  1. Cost — Sending 100K tokens of context on every request is expensive. With RAG, you only send the 2-5 relevant chunks (2,000-5,000 tokens).
  2. Accuracy — Claude performs better with focused, relevant context than with a massive dump of everything.
  3. Scale — RAG works with millions of documents. You cannot fit millions of documents in one prompt.

RAG Architecture

A RAG system has two phases:

Ingestion Phase (Once)

Document → Parse → Chunk → Embed → Store in Vector DB

Query Phase (Every Question)

Question → Embed → Search Vector DB → Get Top Chunks → Send to Claude → Answer

Let us build both phases step by step.


Step 1: Chunking Documents

Large documents need to be split into smaller pieces (chunks). Claude answers better when it gets focused, relevant chunks instead of entire documents.

Python

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
    """Split text into overlapping chunks by character count."""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        # Try to break at a sentence boundary
        if end < len(text):
            last_period = chunk.rfind(".")
            last_newline = chunk.rfind("\n")
            break_point = max(last_period, last_newline)
            if break_point > chunk_size * 0.5:
                chunk = chunk[: break_point + 1]
                end = start + break_point + 1

        chunks.append(chunk.strip())
        start = end - overlap

    return chunks

# Example
document = open("company-handbook.txt").read()
chunks = chunk_text(document, chunk_size=500, overlap=100)
print(f"Created {len(chunks)} chunks from document")

TypeScript

function chunkText(
  text: string,
  chunkSize: number = 500,
  overlap: number = 100
): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    let end = start + chunkSize;
    let chunk = text.slice(start, end);

    // Try to break at a sentence boundary
    if (end < text.length) {
      const lastPeriod = chunk.lastIndexOf(".");
      const lastNewline = chunk.lastIndexOf("\n");
      const breakPoint = Math.max(lastPeriod, lastNewline);
      if (breakPoint > chunkSize * 0.5) {
        chunk = chunk.slice(0, breakPoint + 1);
        end = start + breakPoint + 1;
      }
    }

    chunks.push(chunk.trim());
    start = end - overlap;
  }

  return chunks;
}

Chunking Strategy

StrategyChunk SizeOverlapBest For
Small chunks200-500 chars50-100Precise Q&A, short documents
Medium chunks500-1000 chars100-200General purpose, most documents
Large chunks1000-2000 chars200-400Long-form analysis, research papers

Start with 500 characters and 100 overlap. Adjust based on your results.


Step 2: Creating Embeddings

Embeddings convert text into numbers (vectors). Similar text produces similar vectors. This is how we search for relevant chunks later.

Python (Using OpenAI Embeddings)

from openai import OpenAI

openai_client = OpenAI()

def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Get embeddings for a list of texts."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

# Embed all chunks
chunks = chunk_text(document)
embeddings = get_embeddings(chunks)
print(f"Created {len(embeddings)} embeddings, dimension: {len(embeddings[0])}")

TypeScript

import OpenAI from "openai";

const openai = new OpenAI();

async function getEmbeddings(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map((item) => item.embedding);
}

You can also use open-source embedding models like sentence-transformers in Python. OpenAI embeddings are used here because they are simple and widely available.


Step 3: Storing in a Vector Database

Vector databases store embeddings and let you search by similarity. We will use ChromaDB for local development and mention pgvector for production.

Python (ChromaDB)

import chromadb

# Create a persistent client
chroma = chromadb.PersistentClient(path="./chroma_db")

# Create a collection
collection = chroma.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"},
)

def ingest_document(doc_path: str):
    """Parse, chunk, and store a document."""
    with open(doc_path) as f:
        text = f.read()

    chunks = chunk_text(text, chunk_size=500, overlap=100)
    embeddings = get_embeddings(chunks)

    # Store chunks with their embeddings
    collection.add(
        ids=[f"{doc_path}_{i}" for i in range(len(chunks))],
        embeddings=embeddings,
        documents=chunks,
        metadatas=[{"source": doc_path, "chunk_index": i} for i in range(len(chunks))],
    )
    print(f"Ingested {len(chunks)} chunks from {doc_path}")

# Ingest your documents
ingest_document("company-handbook.txt")
ingest_document("product-manual.txt")

TypeScript (ChromaDB)

import { ChromaClient } from "chromadb";
import { readFileSync } from "fs";

const chroma = new ChromaClient();

async function ingestDocument(docPath: string): Promise<void> {
  const text = readFileSync(docPath, "utf-8");
  const chunks = chunkText(text, 500, 100);
  const embeddings = await getEmbeddings(chunks);

  const collection = await chroma.getOrCreateCollection({
    name: "documents",
    metadata: { "hnsw:space": "cosine" },
  });

  await collection.add({
    ids: chunks.map((_, i) => `${docPath}_${i}`),
    embeddings,
    documents: chunks,
    metadatas: chunks.map((_, i) => ({ source: docPath, chunk_index: i })),
  });

  console.log(`Ingested ${chunks.length} chunks from ${docPath}`);
}

pgvector for Production

For production, pgvector is a better choice. It runs inside PostgreSQL, so you get a real database with ACID transactions, backups, and SQL queries.

-- Enable the extension
CREATE EXTENSION vector;

-- Create the table
CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    source TEXT NOT NULL,
    chunk_index INTEGER,
    embedding vector(1536)
);

-- Create an index for fast similarity search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops);

-- Search for similar chunks
SELECT content, source, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
ORDER BY embedding <=> $1
LIMIT 5;

Step 4: Querying with Claude

Now we combine retrieval and generation. Search for relevant chunks, then send them to Claude with the question.

Python

import anthropic

claude = anthropic.Anthropic()

def ask_question(question: str, top_k: int = 5) -> str:
    """Ask a question and get an answer based on your documents."""
    # Step 1: Get the question embedding
    question_embedding = get_embeddings([question])[0]

    # Step 2: Search for relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k,
    )

    chunks = results["documents"][0]
    sources = results["metadatas"][0]

    # Step 3: Build the context
    context = "\n\n---\n\n".join(
        [f"[Source: {s['source']}, Chunk {s['chunk_index']}]\n{chunk}"
         for chunk, s in zip(chunks, sources)]
    )

    # Step 4: Ask Claude
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="""You are a helpful assistant that answers questions based on the provided context.

<rules>
- Only answer based on the provided context
- If the context does not contain the answer, say "I could not find this information in the provided documents"
- Cite your sources by mentioning the document name
- Be concise and direct
</rules>""",
        messages=[
            {
                "role": "user",
                "content": f"""<context>
{context}
</context>

<question>
{question}
</question>

Answer this question based on the context above.""",
            }
        ],
    )

    return response.content[0].text

# Ask questions
answer = ask_question("What is the company vacation policy?")
print(answer)

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const claude = new Anthropic();

async function askQuestion(
  question: string,
  topK: number = 5
): Promise<string> {
  // Step 1: Get the question embedding
  const questionEmbedding = (await getEmbeddings([question]))[0];

  // Step 2: Search for relevant chunks
  const collection = await chroma.getCollection({ name: "documents" });
  const results = await collection.query({
    queryEmbeddings: [questionEmbedding],
    nResults: topK,
  });

  const chunks = results.documents[0];
  const sources = results.metadatas[0];

  // Step 3: Build context
  const context = chunks
    .map(
      (chunk, i) =>
        `[Source: ${sources[i]?.source}, Chunk ${sources[i]?.chunk_index}]\n${chunk}`
    )
    .join("\n\n---\n\n");

  // Step 4: Ask Claude
  const response = await claude.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: `You are a helpful assistant that answers questions based on the provided context.

<rules>
- Only answer based on the provided context
- If the context does not contain the answer, say "I could not find this information in the provided documents"
- Cite your sources by mentioning the document name
- Be concise and direct
</rules>`,
    messages: [
      {
        role: "user",
        content: `<context>\n${context}\n</context>\n\n<question>\n${question}\n</question>\n\nAnswer this question based on the context above.`,
      },
    ],
  });

  if (response.content[0].type === "text") {
    return response.content[0].text;
  }
  return "";
}

Contextual Retrieval

Anthropic published a technique called Contextual Retrieval that reduces failed retrievals by 49%. The idea is simple: before embedding each chunk, use Claude to add context about where the chunk fits in the document.

Python

def add_context_to_chunks(document: str, chunks: list[str]) -> list[str]:
    """Add document context to each chunk for better retrieval."""
    contextualized = []

    for chunk in chunks:
        response = claude.messages.create(
            model="claude-haiku-4-5",  # Use Haiku for cost efficiency
            max_tokens=200,
            messages=[
                {
                    "role": "user",
                    "content": f"""<document>
{document[:10000]}
</document>

<chunk>
{chunk}
</chunk>

Write a short (1-2 sentence) context that explains where this chunk fits in the document. Start with "This chunk...".""",
                }
            ],
        )

        context = response.content[0].text
        contextualized.append(f"{context}\n\n{chunk}")

    return contextualized

# Use contextual chunks for better embeddings
chunks = chunk_text(document)
contextual_chunks = add_context_to_chunks(document, chunks)
embeddings = get_embeddings(contextual_chunks)

This costs more during ingestion (one Haiku call per chunk), but significantly improves retrieval accuracy.


Combine embedding search with keyword search (BM25) for better results. Embedding search finds semantically similar content. Keyword search finds exact matches.

from rank_bm25 import BM25Okapi

def hybrid_search(question: str, top_k: int = 5) -> list[str]:
    """Combine vector search with BM25 keyword search."""
    # Get all chunks and their IDs
    all_data = collection.get()
    all_ids = all_data["ids"]
    all_chunks = all_data["documents"]
    id_to_index = {doc_id: i for i, doc_id in enumerate(all_ids)}

    # Vector search
    question_embedding = get_embeddings([question])[0]
    vector_results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k * 2,
    )

    # BM25 keyword search
    tokenized = [chunk.lower().split() for chunk in all_chunks]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(question.lower().split())
    max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1

    # Combine scores (normalize and weight)
    # 0.7 weight for vector, 0.3 for BM25
    combined = {}
    for i, doc_id in enumerate(vector_results["ids"][0]):
        combined[doc_id] = 0.7 * (1 - i / len(vector_results["ids"][0]))

    for i, doc_id in enumerate(all_ids):
        normalized_bm25 = bm25_scores[i] / max_bm25
        combined[doc_id] = combined.get(doc_id, 0) + 0.3 * normalized_bm25

    # Return top-k by combined score
    sorted_ids = sorted(combined, key=combined.get, reverse=True)[:top_k]
    return [all_chunks[id_to_index[doc_id]] for doc_id in sorted_ids]

Prompt Caching for RAG

If your users ask many questions about the same documents, use prompt caching. Cache the system prompt and document context, and only vary the question:

# First request — writes to cache
response = claude.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a document Q&A assistant. Answer only from the provided context.",
        },
        {
            "type": "text",
            "text": f"<documents>\n{all_document_context}\n</documents>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "What is the vacation policy?"}],
)

# Subsequent requests — reads from cache (90% cheaper for cached tokens)
response = claude.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a document Q&A assistant. Answer only from the provided context.",
        },
        {
            "type": "text",
            "text": f"<documents>\n{all_document_context}\n</documents>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "How many sick days do employees get?"}],
)

The cached tokens cost 90% less on subsequent requests. This is very effective for RAG where the document context stays the same.


Cost Breakdown

OperationModelCost
Embed 100 chunkstext-embedding-3-small~$0.002
Contextual retrieval (100 chunks)Haiku 4.5~$0.10
Single query (5 chunks context)Sonnet 4.6~$0.02
Single query (cached context)Sonnet 4.6~$0.005

For a typical document Q&A system with 100 pages and 50 queries per day, expect about $1-2/day with caching enabled.


Summary

ConceptDetails
RAGRetrieve relevant context, then generate answers
ChunkingSplit documents into 500-char pieces with overlap
EmbeddingsConvert text to vectors for similarity search
Vector DBChromaDB (local) or pgvector (production)
Contextual retrievalAdd document context to chunks before embedding
Hybrid searchCombine vector + keyword search
Prompt cachingCache document context for 90% cheaper queries

What’s Next?

In the next article, we will cover prompt testing and evaluation — how to systematically improve your prompts with metrics and automated tests.

Next: Fine-Tuning Prompts — Evaluation and Testing