Every time you call the Claude API with the same system prompt, you pay full price for those input tokens. Prompt caching fixes this. Cache your repeated context once, then pay only 10% on every subsequent call.
This is Article 11 in the Claude AI — From Zero to Power User series. You should have completed Article 7: Messages API before this article.
By the end of this article, you will know how prompt caching works, when it pays off, and how to implement it in your applications.
What is Prompt Caching?
Prompt caching lets you mark parts of your request as cacheable. The first time you send those tokens, Claude processes them and stores them in a cache. On subsequent requests with the same prefix, Claude reads from the cache instead of reprocessing.
Result: Cache reads cost only 10% of normal input token pricing.
How It Works
- You add
cache_controlmarkers to your messages - First request: normal cost + small write premium
- Following requests: cached content costs 90% less
- Cache expires after 5 minutes (default) or 1 hour (extended)
Cache Pricing
| Operation | Cost Multiplier |
|---|---|
| Normal input | 1.0x |
| Cache write (5-min TTL) | 1.25x |
| Cache write (1-hour TTL) | 2.0x |
| Cache read | 0.1x |
With Sonnet 4.6 ($3.00/MTok input):
| Operation | Cost per MTok |
|---|---|
| Normal input | $3.00 |
| Cache write (5-min) | $3.75 |
| Cache write (1-hour) | $6.00 |
| Cache read | $0.30 |
The math: A 10,000 token system prompt costs $0.03 per call normally. With caching, the first call costs $0.0375 (5-min write), but every subsequent call costs only $0.003. After just 2 calls, you save money.
When Caching Pays Off
5-Minute TTL (Default)
Break-even: after 1 cache read
Use when:
- You make multiple API calls within a few minutes
- Chatbot conversations (same system prompt for all turns)
- Batch processing with the same context
1-Hour TTL
Break-even: after 2 cache reads
Use when:
- You need the cache to persist longer
- Users return to the same conversation within an hour
- Background jobs that run periodically
Implementation: System Prompt Caching
The most common pattern — cache your system prompt.
Python
import anthropic
client = anthropic.Anthropic()
# Define the system prompt with cache_control
system_prompt = [
{
"type": "text",
"text": """You are a senior Python developer. You specialize in FastAPI, SQLAlchemy, and PostgreSQL.
Rules:
- Always include type hints
- Use async/await for database operations
- Follow PEP 8 style
- Include error handling with proper HTTP status codes
- Use Pydantic v2 for request/response models
Your codebase uses:
- FastAPI 0.115
- SQLAlchemy 2.0 (async engine)
- PostgreSQL 16
- Alembic for migrations
- pytest with httpx for testing
[... imagine 2000+ tokens of project context here ...]""",
"cache_control": {"type": "ephemeral"} # 5-min TTL
}
]
# First call — cache write (1.25x cost on system prompt)
response1 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=system_prompt,
messages=[{"role": "user", "content": "Write a GET endpoint to list all users with pagination"}]
)
print(f"Cache write tokens: {response1.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response1.usage.cache_read_input_tokens}")
# Second call — cache hit (0.1x cost on system prompt)
response2 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=system_prompt,
messages=[{"role": "user", "content": "Write a POST endpoint to create a new user"}]
)
print(f"Cache write tokens: {response2.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response2.usage.cache_read_input_tokens}")
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const systemPrompt: Anthropic.TextBlockParam[] = [
{
type: "text",
text: `You are a senior Python developer. You specialize in FastAPI, SQLAlchemy, and PostgreSQL.
Rules:
- Always include type hints
- Use async/await for database operations
- Follow PEP 8 style
[... project context ...]`,
cache_control: { type: "ephemeral" }, // 5-min TTL
},
];
// First call — cache write
const response1 = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: systemPrompt,
messages: [
{ role: "user", content: "Write a GET endpoint to list all users" },
],
});
console.log(`Cache write: ${response1.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response1.usage.cache_read_input_tokens}`);
// Second call — cache hit
const response2 = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: systemPrompt,
messages: [
{ role: "user", content: "Write a POST endpoint to create a user" },
],
});
console.log(`Cache write: ${response2.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response2.usage.cache_read_input_tokens}`);
On the first call, cache_creation_input_tokens shows how many tokens were cached. On the second call, cache_read_input_tokens shows how many tokens were read from cache.
Implementation: 1-Hour TTL
For longer cache duration, use the ttl field:
system_prompt = [
{
"type": "text",
"text": "Your long system prompt here...",
"cache_control": {
"type": "ephemeral",
"ttl": {"type": "duration", "seconds": 3600} # 1 hour
}
}
]
The 1-hour TTL costs 2x for the write instead of 1.25x, but the cache lasts 12 times longer.
What Can Be Cached
You can cache:
- System prompts — Most common use case
- Tool definitions — Cache your tool schemas
- Images — Cache images used in every request
- Conversation prefixes — Cache earlier turns in a conversation
- Long documents — Cache reference material
Caching Tool Definitions
tools = [
{
"name": "search_docs",
"description": "Search documentation",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
},
"cache_control": {"type": "ephemeral"}
}
]
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "Find docs about authentication"}]
)
Caching Conversation History
In multi-turn conversations, cache the earlier turns so you only pay full price for the latest message:
messages = [
# Earlier turns — cached
{
"role": "user",
"content": "Explain Python decorators"
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "A decorator is a function that wraps another function...",
"cache_control": {"type": "ephemeral"}
}
]
},
# New turn — not cached
{
"role": "user",
"content": "Show me a real-world example of a decorator"
}
]
Put the cache_control on the last message you want cached. Everything before it (including that message) gets cached.
Minimum Cacheable Size
Not all content can be cached. There is a minimum size:
| Model | Minimum Tokens |
|---|---|
| Opus 4.6 | 1,024 tokens |
| Sonnet 4.6 | 1,024 tokens |
| Haiku 4.5 | 2,048 tokens |
If your system prompt is under the minimum, it cannot be cached. In practice, any substantial system prompt (project context, rules, examples) exceeds 1,024 tokens easily.
Monitoring Cache Usage
Track cache performance with the usage fields in every response:
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
| Field | Meaning |
|---|---|
cache_creation_input_tokens | Tokens written to cache (charged at write rate) |
cache_read_input_tokens | Tokens read from cache (charged at 0.1x) |
input_tokens | Non-cached input tokens (charged at normal rate) |
A well-cached application should show high cache_read_input_tokens and zero cache_creation_input_tokens on most calls.
RAG with Caching
Retrieval-Augmented Generation (RAG) is a perfect use case for prompt caching. Cache the document context, vary only the query.
Python
import anthropic
client = anthropic.Anthropic()
# Load your reference documents (done once)
with open("documentation.txt") as f:
docs = f.read()
system_prompt = [
{
"type": "text",
"text": f"""You are a documentation assistant. Answer questions based ONLY on the provided documentation.
<documentation>
{docs}
</documentation>
Rules:
- Only answer from the documentation above
- If the answer is not in the documentation, say "I could not find this in the documentation"
- Quote relevant sections when possible""",
"cache_control": {"type": "ephemeral"}
}
]
def ask(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=system_prompt,
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
# First question — cache write
print(ask("How do I authenticate with the API?"))
# Second question — cache hit (90% cheaper on the docs)
print(ask("What are the rate limits?"))
# Third question — still a cache hit
print(ask("How do I handle errors?"))
If your documentation is 50,000 tokens, each question after the first costs 90% less for those tokens. That is $0.15 per question vs $0.015 per question with Sonnet 4.6.
Cost Calculation Example
A real scenario: a chatbot with a 5,000 token system prompt, handling 100 messages per hour.
Without Caching
100 messages x 5,000 tokens x $3.00/MTok = $1.50/hour
With 5-Minute Caching
First message each 5-min window: 5,000 tokens x $3.75/MTok = $0.019
~8 subsequent messages: 5,000 tokens x $0.30/MTok = $0.012
Per 5-min window: $0.031
12 windows per hour: $0.37/hour
Savings: 75% ($1.13/hour saved)
With 1-Hour Caching
First message: 5,000 tokens x $6.00/MTok = $0.030
99 subsequent: 5,000 tokens x $0.30/MTok = $0.149
Per hour: $0.179
Savings: 88% ($1.32/hour saved)
At scale (1 million messages/month), the savings become significant — thousands of dollars per month.
Caching with Extended Thinking
When you use extended thinking with caching, the thinking tokens from previous turns get cached automatically. This is especially valuable because thinking tokens can be large.
messages = [
{"role": "user", "content": "Solve this complex math problem: ..."},
{
"role": "assistant",
"content": [
# Thinking block from previous turn
{"type": "thinking", "thinking": "Let me work through this step by step..."},
{
"type": "text",
"text": "The answer is 42.",
"cache_control": {"type": "ephemeral"}
}
]
},
# New question — thinking from above is cached
{"role": "user", "content": "Now explain your reasoning more simply"}
]
Multiple Cache Breakpoints
You can have up to 4 cache breakpoints in a single request. Use this to cache different parts independently:
system_prompt = [
{
"type": "text",
"text": "You are a code reviewer...",
"cache_control": {"type": "ephemeral"} # Breakpoint 1
}
]
tools = [
{
"name": "report_issue",
"description": "Report a code issue",
"input_schema": {"type": "object", "properties": {"issue": {"type": "string"}}, "required": ["issue"]},
"cache_control": {"type": "ephemeral"} # Breakpoint 2
}
]
Each breakpoint creates an independent cache entry. This is useful when different parts of your request change at different rates.
Best Practices
1. Put Static Content First
Cache works on prefixes. Put your cacheable content (system prompt, tools, documents) at the beginning of the request. Put variable content (the user’s message) at the end.
2. Keep Cache Content Identical
The cache key is based on the exact token sequence. Even a single character change in your system prompt invalidates the cache and creates a new entry.
3. Monitor Cache Hit Rate
Track cache_read_input_tokens vs cache_creation_input_tokens. If you see frequent cache misses, check that your cached content is not changing between requests.
4. Choose the Right TTL
- Use 5-minute TTL for chatbots and real-time applications
- Use 1-hour TTL for batch processing and periodic jobs
- Use 1-hour TTL when users might return to a conversation after a break
5. Combine with Batch API
Prompt caching and the Batch API discounts stack. You can get cache reads at 0.1x AND batch discount at 0.5x, for an effective 0.05x cost on cached batch input tokens.
Summary
| Feature | Details |
|---|---|
| Cache read cost | 10% of normal input cost |
| Cache write cost | 1.25x (5-min) or 2x (1-hour) |
| Break-even | 1 read (5-min) or 2 reads (1-hour) |
| Minimum size | 1,024 tokens (Opus/Sonnet), 2,048 (Haiku) |
| Max breakpoints | 4 per request |
| What to cache | System prompts, tools, images, conversation history, documents |
Prompt caching is the easiest way to reduce Claude API costs. If you have any repeated context — a system prompt, tool definitions, or reference documents — cache it.
What’s Next?
In the next article, we will cover Extended Thinking — Claude’s reasoning mode that improves accuracy on complex problems.
Next: Extended Thinking — Claude’s Reasoning Mode