Claude API costs add up fast if you are not careful. A simple change like enabling prompt caching can cut your bill by 90%. Using the Batch API saves 50%. Stack them together and you save up to 95%.
This is Article 25 in the Claude AI — From Zero to Power User series. You should know Understanding Models and Extended Thinking before this article.
The Cost Equation
Your Claude API cost is determined by three factors:
Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price)
Each model has different prices per million tokens:
| Model | Input (per 1M) | Output (per 1M) |
|---|---|---|
| Opus 4.6 | $5.00 | $25.00 |
| Sonnet 4.6 | $3.00 | $15.00 |
| Haiku 4.5 | $1.00 | $5.00 |
Output tokens cost 5x more than input tokens. This means the biggest savings come from reducing output.
Strategy 1: Model Selection
Use the cheapest model that meets your quality needs.
The Model Ladder
Haiku 4.5 → Sonnet 4.6 → Opus 4.6
Cheap Sweet spot Best quality
$1/$5 $3/$15 $5/$25
When to Use Each Model
| Use Case | Model | Why |
|---|---|---|
| Classification | Haiku 4.5 | Simple yes/no decisions |
| Data extraction | Haiku 4.5 | Structured output from templates |
| Code review triage | Haiku 4.5 | Quick scan, flag files |
| General coding | Sonnet 4.6 | Best price/quality |
| Content writing | Sonnet 4.6 | Good enough for most content |
| Complex algorithms | Opus 4.6 | Needs deep reasoning |
| Legal/medical analysis | Opus 4.6 | Accuracy is critical |
Python: Dynamic Model Selection
import anthropic
client = anthropic.Anthropic()
def select_model(task_type: str, input_length: int) -> str:
"""Select the best model based on task and input size."""
if task_type in ("classify", "extract", "triage"):
return "claude-haiku-4-5"
elif task_type in ("complex", "legal", "algorithm"):
return "claude-opus-4-6"
elif input_length > 100000:
return "claude-sonnet-4-6" # Large context, needs good model
else:
return "claude-sonnet-4-6" # Default
# Example: classify emails with Haiku (cheap)
model = select_model("classify", 500)
response = client.messages.create(
model=model,
max_tokens=100,
messages=[{"role": "user", "content": "Classify: 'Your order has shipped' → spam or not_spam"}],
)
# Cost: ~$0.0005 per classification
Savings Example
Processing 10,000 classifications per day:
| Model | Cost per day | Cost per month |
|---|---|---|
| Opus 4.6 | $5.00 | $150.00 |
| Sonnet 4.6 | $3.00 | $90.00 |
| Haiku 4.5 | $1.00 | $30.00 |
Switching from Opus to Haiku saves $120/month with no quality loss for simple classifications.
Strategy 2: Prompt Caching
Prompt caching lets you cache the system prompt and other static content. Cached tokens cost 90% less on read.
How It Works
- First request: Write to cache (25% extra cost for writing)
- Subsequent requests: Read from cache (90% cheaper)
- Cache TTL: 5 minutes (resets on each use)
Python
# Large system prompt that stays the same across requests
system_prompt = """You are a code reviewer for a Python FastAPI project.
[... 2000 words of review instructions ...]"""
# First request — writes to cache
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Review: def foo(): pass"}],
)
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
# Subsequent requests — reads from cache (90% cheaper input)
for diff in diffs:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": f"Review: {diff}"}],
)
TypeScript
const systemPrompt = `You are a code reviewer...`; // 2000 words
// Cached — 90% cheaper on repeat
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: `Review: ${diff}` }],
});
When Caching Saves Money
Caching is profitable when you make multiple requests with the same cached content within 5 minutes.
| Cached Tokens | Requests | Without Cache | With Cache | Savings |
|---|---|---|---|---|
| 5,000 | 10 | $0.15 | $0.03 | 80% |
| 10,000 | 50 | $1.50 | $0.18 | 88% |
| 50,000 | 100 | $15.00 | $1.65 | 89% |
The breakeven point is about 3-4 requests within the TTL window.
Strategy 3: Batch API
The Batch API processes requests asynchronously at 50% off. Results come back within 24 hours (usually faster).
Python
import anthropic
import time
client = anthropic.Anthropic()
# Create a batch of requests
batch = client.batches.create(
requests=[
{
"custom_id": f"review-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 2048,
"messages": [
{"role": "user", "content": f"Review this code:\n{code}"}
],
},
}
for i, code in enumerate(code_files)
],
)
print(f"Batch created: {batch.id}")
print(f"Status: {batch.processing_status}")
# Poll for completion
while True:
batch = client.batches.retrieve(batch.id)
if batch.processing_status == "ended":
break
print(f"Status: {batch.processing_status}...")
time.sleep(30)
# Get results
results = list(client.batches.results(batch.id))
for result in results:
print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}")
TypeScript
const batch = await client.batches.create({
requests: codeFiles.map((code, i) => ({
custom_id: `review-${i}`,
params: {
model: "claude-sonnet-4-6",
max_tokens: 2048,
messages: [
{ role: "user", content: `Review this code:\n${code}` },
],
},
})),
});
console.log(`Batch created: ${batch.id}`);
When to Use Batch API
- Code reviews that are not time-sensitive
- Content generation (articles, summaries)
- Data processing (classification, extraction)
- Nightly analysis jobs
- Any workload where 24-hour turnaround is acceptable
Strategy 4: Stack Discounts (Batch + Caching)
Combine Batch API (50% off) with prompt caching (90% off cached tokens) for maximum savings:
batch = client.batches.create(
requests=[
{
"custom_id": f"item-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 2048,
"system": [
{
"type": "text",
"text": large_system_prompt, # Cached across batch
"cache_control": {"type": "ephemeral"},
}
],
"messages": [
{"role": "user", "content": f"Process: {item}"}
],
},
}
for i, item in enumerate(items)
],
)
Savings Calculation
Processing 1,000 items with 10K cached tokens and 500 output tokens each:
| Strategy | Input Cost | Output Cost | Total | Savings |
|---|---|---|---|---|
| None (Sonnet) | $30.00 | $7.50 | $37.50 | — |
| Caching only | $3.75 | $7.50 | $11.25 | 70% |
| Batch only | $15.00 | $3.75 | $18.75 | 50% |
| Batch + Cache | $1.88 | $3.75 | $5.63 | 85% |
Strategy 5: Control Output Tokens
Output tokens cost 5x more than input. Reduce them:
Set Appropriate max_tokens
# Bad — default max_tokens is wasteful for short answers
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096, # You only need 100 tokens for classification
messages=[{"role": "user", "content": "Is this spam? Yes or no."}],
)
# Good — set max_tokens to what you actually need
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=50,
messages=[{"role": "user", "content": "Is this spam? Yes or no."}],
)
Ask for Concise Responses
system = """Be concise. Answer in 1-2 sentences maximum.
Do not explain your reasoning unless asked.
For yes/no questions, respond with just Yes or No."""
Use Structured Output
JSON responses are typically shorter than prose:
system = """Return JSON only. No explanation, no markdown.
Format: {"category": "...", "confidence": 0.95}"""
Strategy 6: Manage Context Length
Longer conversations use more input tokens on every request. Manage the context:
def trim_conversation(messages: list, max_tokens: int = 50000) -> list:
"""Keep conversation under token budget."""
total = sum(len(str(m["content"])) // 4 for m in messages)
while total > max_tokens and len(messages) > 2:
# Remove oldest messages (keep first for context)
messages.pop(1)
total = sum(len(str(m["content"])) // 4 for m in messages)
return messages
def summarize_conversation(messages: list) -> str:
"""Summarize old messages to save tokens."""
old_messages = messages[:-4] # Keep last 4 messages
summary = client.messages.create(
model="claude-haiku-4-5", # Use Haiku for summarization
max_tokens=500,
messages=[{
"role": "user",
"content": f"Summarize this conversation in 2-3 sentences:\n{json.dumps(old_messages)}",
}],
)
return summary.content[0].text
Strategy 7: Extended Thinking Budget
Extended thinking tokens cost the same as output tokens. Set an appropriate budget:
# Expensive — large thinking budget for a simple task
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=64000,
thinking={"type": "enabled", "budget_tokens": 50000},
messages=[{"role": "user", "content": "What is 2 + 2?"}],
)
# Better — only use thinking for complex tasks
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8192,
thinking={"type": "enabled", "budget_tokens": 5000},
messages=[{"role": "user", "content": "Implement a balanced BST with delete operation"}],
)
Only enable extended thinking when the task requires multi-step reasoning.
Monitoring Costs
Track your spending with usage data from each response:
def track_cost(response, model: str) -> dict:
"""Calculate cost from response usage."""
prices = {
"claude-opus-4-6": {"input": 5.0, "output": 25.0},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5": {"input": 1.0, "output": 5.0},
}
price = prices.get(model, prices["claude-sonnet-4-6"])
# Cache-aware cost calculation
cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
cache_write = getattr(response.usage, "cache_creation_input_tokens", 0)
regular_input = response.usage.input_tokens - cache_read
input_cost = regular_input * price["input"] / 1_000_000
cache_read_cost = cache_read * price["input"] * 0.1 / 1_000_000 # 90% off
cache_write_cost = cache_write * price["input"] * 1.25 / 1_000_000 # 25% extra
output_cost = response.usage.output_tokens * price["output"] / 1_000_000
total_cost = input_cost + cache_read_cost + cache_write_cost + output_cost
savings = cache_read * price["input"] * 0.9 / 1_000_000 # What you saved
return {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_read_tokens": cache_read,
"input_cost": input_cost + cache_read_cost + cache_write_cost,
"output_cost": output_cost,
"total_cost": total_cost,
"cache_savings": savings,
}
# Track every request
cost = track_cost(response, "claude-sonnet-4-6")
print(f"Cost: ${cost['total_cost']:.4f} (saved ${cost['cache_savings']:.4f} from cache)")
Real-World Cost Example
A typical SaaS application processing 10,000 requests per day:
| Request Type | Model | Requests | Cost/Request | Daily Cost |
|---|---|---|---|---|
| Email classification | Haiku 4.5 | 5,000 | $0.001 | $5.00 |
| Content moderation | Haiku 4.5 | 3,000 | $0.001 | $3.00 |
| Code review | Sonnet 4.6 (cached) | 500 | $0.005 | $2.50 |
| Complex analysis | Opus 4.6 | 100 | $0.10 | $10.00 |
| Batch processing | Sonnet 4.6 (batch) | 1,400 | $0.005 | $7.00 |
| Total | 10,000 | $27.50/day |
That is about $825/month for 10,000 requests per day — less than the cost of one developer’s time.
Cost Optimization Checklist
- Use Haiku for simple tasks (classification, extraction, triage)
- Use Sonnet for standard tasks (code, content, analysis)
- Use Opus only for complex reasoning
- Enable prompt caching for repeated system prompts
- Use Batch API for non-urgent workloads
- Stack Batch + caching for maximum savings
- Set appropriate
max_tokenslimits - Ask for concise, structured output
- Trim conversation history to control context length
- Set appropriate extended thinking budgets
- Monitor costs on every request
- Review and optimize monthly
Summary
| Strategy | Savings | When to Use |
|---|---|---|
| Model selection | Up to 95% | Always — use cheapest model that works |
| Prompt caching | 90% on cached tokens | Repeated system prompts or context |
| Batch API | 50% | Async workloads (24h turnaround OK) |
| Batch + Cache | Up to 95% | Large batch jobs with shared context |
| Output control | 20-50% | Set max_tokens, ask for concise replies |
| Context management | 30-60% | Long conversations |
What’s Next?
In the next article, we will create a Claude Cheat Sheet — a quick reference for all APIs, models, and tips on one page.
Next: Claude Cheat Sheet — All APIs, Models, Tips