Every time you call the Claude API with the same system prompt, you pay full price for those input tokens. Prompt caching fixes this. Cache your repeated context once, then pay only 10% on every subsequent call.

This is Article 11 in the Claude AI — From Zero to Power User series. You should have completed Article 7: Messages API before this article.

By the end of this article, you will know how prompt caching works, when it pays off, and how to implement it in your applications.


What is Prompt Caching?

Prompt caching lets you mark parts of your request as cacheable. The first time you send those tokens, Claude processes them and stores them in a cache. On subsequent requests with the same prefix, Claude reads from the cache instead of reprocessing.

Result: Cache reads cost only 10% of normal input token pricing.

How It Works

  1. You add cache_control markers to your messages
  2. First request: normal cost + small write premium
  3. Following requests: cached content costs 90% less
  4. Cache expires after 5 minutes (default) or 1 hour (extended)

Cache Pricing

OperationCost Multiplier
Normal input1.0x
Cache write (5-min TTL)1.25x
Cache write (1-hour TTL)2.0x
Cache read0.1x

With Sonnet 4.6 ($3.00/MTok input):

OperationCost per MTok
Normal input$3.00
Cache write (5-min)$3.75
Cache write (1-hour)$6.00
Cache read$0.30

The math: A 10,000 token system prompt costs $0.03 per call normally. With caching, the first call costs $0.0375 (5-min write), but every subsequent call costs only $0.003. After just 2 calls, you save money.


When Caching Pays Off

5-Minute TTL (Default)

Break-even: after 1 cache read

Use when:

  • You make multiple API calls within a few minutes
  • Chatbot conversations (same system prompt for all turns)
  • Batch processing with the same context

1-Hour TTL

Break-even: after 2 cache reads

Use when:

  • You need the cache to persist longer
  • Users return to the same conversation within an hour
  • Background jobs that run periodically

Implementation: System Prompt Caching

The most common pattern — cache your system prompt.

Python

import anthropic

client = anthropic.Anthropic()

# Define the system prompt with cache_control
system_prompt = [
    {
        "type": "text",
        "text": """You are a senior Python developer. You specialize in FastAPI, SQLAlchemy, and PostgreSQL.

Rules:
- Always include type hints
- Use async/await for database operations
- Follow PEP 8 style
- Include error handling with proper HTTP status codes
- Use Pydantic v2 for request/response models

Your codebase uses:
- FastAPI 0.115
- SQLAlchemy 2.0 (async engine)
- PostgreSQL 16
- Alembic for migrations
- pytest with httpx for testing

[... imagine 2000+ tokens of project context here ...]""",
        "cache_control": {"type": "ephemeral"}  # 5-min TTL
    }
]

# First call — cache write (1.25x cost on system prompt)
response1 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=system_prompt,
    messages=[{"role": "user", "content": "Write a GET endpoint to list all users with pagination"}]
)

print(f"Cache write tokens: {response1.usage.cache_creation_input_tokens}")
print(f"Cache read tokens:  {response1.usage.cache_read_input_tokens}")

# Second call — cache hit (0.1x cost on system prompt)
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=system_prompt,
    messages=[{"role": "user", "content": "Write a POST endpoint to create a new user"}]
)

print(f"Cache write tokens: {response2.usage.cache_creation_input_tokens}")
print(f"Cache read tokens:  {response2.usage.cache_read_input_tokens}")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const systemPrompt: Anthropic.TextBlockParam[] = [
  {
    type: "text",
    text: `You are a senior Python developer. You specialize in FastAPI, SQLAlchemy, and PostgreSQL.

Rules:
- Always include type hints
- Use async/await for database operations
- Follow PEP 8 style

[... project context ...]`,
    cache_control: { type: "ephemeral" }, // 5-min TTL
  },
];

// First call — cache write
const response1 = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  system: systemPrompt,
  messages: [
    { role: "user", content: "Write a GET endpoint to list all users" },
  ],
});

console.log(`Cache write: ${response1.usage.cache_creation_input_tokens}`);
console.log(`Cache read:  ${response1.usage.cache_read_input_tokens}`);

// Second call — cache hit
const response2 = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  system: systemPrompt,
  messages: [
    { role: "user", content: "Write a POST endpoint to create a user" },
  ],
});

console.log(`Cache write: ${response2.usage.cache_creation_input_tokens}`);
console.log(`Cache read:  ${response2.usage.cache_read_input_tokens}`);

On the first call, cache_creation_input_tokens shows how many tokens were cached. On the second call, cache_read_input_tokens shows how many tokens were read from cache.


Implementation: 1-Hour TTL

For longer cache duration, use the ttl field:

system_prompt = [
    {
        "type": "text",
        "text": "Your long system prompt here...",
        "cache_control": {
            "type": "ephemeral",
            "ttl": {"type": "duration", "seconds": 3600}  # 1 hour
        }
    }
]

The 1-hour TTL costs 2x for the write instead of 1.25x, but the cache lasts 12 times longer.


What Can Be Cached

You can cache:

  1. System prompts — Most common use case
  2. Tool definitions — Cache your tool schemas
  3. Images — Cache images used in every request
  4. Conversation prefixes — Cache earlier turns in a conversation
  5. Long documents — Cache reference material

Caching Tool Definitions

tools = [
    {
        "name": "search_docs",
        "description": "Search documentation",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        },
        "cache_control": {"type": "ephemeral"}
    }
]

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Find docs about authentication"}]
)

Caching Conversation History

In multi-turn conversations, cache the earlier turns so you only pay full price for the latest message:

messages = [
    # Earlier turns — cached
    {
        "role": "user",
        "content": "Explain Python decorators"
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "A decorator is a function that wraps another function...",
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    # New turn — not cached
    {
        "role": "user",
        "content": "Show me a real-world example of a decorator"
    }
]

Put the cache_control on the last message you want cached. Everything before it (including that message) gets cached.


Minimum Cacheable Size

Not all content can be cached. There is a minimum size:

ModelMinimum Tokens
Opus 4.61,024 tokens
Sonnet 4.61,024 tokens
Haiku 4.52,048 tokens

If your system prompt is under the minimum, it cannot be cached. In practice, any substantial system prompt (project context, rules, examples) exceeds 1,024 tokens easily.


Monitoring Cache Usage

Track cache performance with the usage fields in every response:

usage = response.usage

print(f"Input tokens:          {usage.input_tokens}")
print(f"Output tokens:         {usage.output_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:     {usage.cache_read_input_tokens}")
FieldMeaning
cache_creation_input_tokensTokens written to cache (charged at write rate)
cache_read_input_tokensTokens read from cache (charged at 0.1x)
input_tokensNon-cached input tokens (charged at normal rate)

A well-cached application should show high cache_read_input_tokens and zero cache_creation_input_tokens on most calls.


RAG with Caching

Retrieval-Augmented Generation (RAG) is a perfect use case for prompt caching. Cache the document context, vary only the query.

Python

import anthropic

client = anthropic.Anthropic()

# Load your reference documents (done once)
with open("documentation.txt") as f:
    docs = f.read()

system_prompt = [
    {
        "type": "text",
        "text": f"""You are a documentation assistant. Answer questions based ONLY on the provided documentation.

<documentation>
{docs}
</documentation>

Rules:
- Only answer from the documentation above
- If the answer is not in the documentation, say "I could not find this in the documentation"
- Quote relevant sections when possible""",
        "cache_control": {"type": "ephemeral"}
    }
]

def ask(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=system_prompt,
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First question — cache write
print(ask("How do I authenticate with the API?"))

# Second question — cache hit (90% cheaper on the docs)
print(ask("What are the rate limits?"))

# Third question — still a cache hit
print(ask("How do I handle errors?"))

If your documentation is 50,000 tokens, each question after the first costs 90% less for those tokens. That is $0.15 per question vs $0.015 per question with Sonnet 4.6.


Cost Calculation Example

A real scenario: a chatbot with a 5,000 token system prompt, handling 100 messages per hour.

Without Caching

100 messages x 5,000 tokens x $3.00/MTok = $1.50/hour

With 5-Minute Caching

First message each 5-min window:  5,000 tokens x $3.75/MTok = $0.019
~8 subsequent messages:           5,000 tokens x $0.30/MTok = $0.012
Per 5-min window:                 $0.031
12 windows per hour:              $0.37/hour

Savings: 75% ($1.13/hour saved)

With 1-Hour Caching

First message:        5,000 tokens x $6.00/MTok = $0.030
99 subsequent:        5,000 tokens x $0.30/MTok = $0.149
Per hour:             $0.179

Savings: 88% ($1.32/hour saved)

At scale (1 million messages/month), the savings become significant — thousands of dollars per month.


Caching with Extended Thinking

When you use extended thinking with caching, the thinking tokens from previous turns get cached automatically. This is especially valuable because thinking tokens can be large.

messages = [
    {"role": "user", "content": "Solve this complex math problem: ..."},
    {
        "role": "assistant",
        "content": [
            # Thinking block from previous turn
            {"type": "thinking", "thinking": "Let me work through this step by step..."},
            {
                "type": "text",
                "text": "The answer is 42.",
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    # New question — thinking from above is cached
    {"role": "user", "content": "Now explain your reasoning more simply"}
]

Multiple Cache Breakpoints

You can have up to 4 cache breakpoints in a single request. Use this to cache different parts independently:

system_prompt = [
    {
        "type": "text",
        "text": "You are a code reviewer...",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1
    }
]

tools = [
    {
        "name": "report_issue",
        "description": "Report a code issue",
        "input_schema": {"type": "object", "properties": {"issue": {"type": "string"}}, "required": ["issue"]},
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2
    }
]

Each breakpoint creates an independent cache entry. This is useful when different parts of your request change at different rates.


Best Practices

1. Put Static Content First

Cache works on prefixes. Put your cacheable content (system prompt, tools, documents) at the beginning of the request. Put variable content (the user’s message) at the end.

2. Keep Cache Content Identical

The cache key is based on the exact token sequence. Even a single character change in your system prompt invalidates the cache and creates a new entry.

3. Monitor Cache Hit Rate

Track cache_read_input_tokens vs cache_creation_input_tokens. If you see frequent cache misses, check that your cached content is not changing between requests.

4. Choose the Right TTL

  • Use 5-minute TTL for chatbots and real-time applications
  • Use 1-hour TTL for batch processing and periodic jobs
  • Use 1-hour TTL when users might return to a conversation after a break

5. Combine with Batch API

Prompt caching and the Batch API discounts stack. You can get cache reads at 0.1x AND batch discount at 0.5x, for an effective 0.05x cost on cached batch input tokens.


Summary

FeatureDetails
Cache read cost10% of normal input cost
Cache write cost1.25x (5-min) or 2x (1-hour)
Break-even1 read (5-min) or 2 reads (1-hour)
Minimum size1,024 tokens (Opus/Sonnet), 2,048 (Haiku)
Max breakpoints4 per request
What to cacheSystem prompts, tools, images, conversation history, documents

Prompt caching is the easiest way to reduce Claude API costs. If you have any repeated context — a system prompt, tool definitions, or reference documents — cache it.


What’s Next?

In the next article, we will cover Extended Thinking — Claude’s reasoning mode that improves accuracy on complex problems.

Next: Extended Thinking — Claude’s Reasoning Mode