Claude AI Tutorial #25: Cost Optimization — Batches, Caching, and Model Selection

Claude API costs add up fast if you are not careful. A simple change like enabling prompt caching can cut your bill by 90%. Using the Batch API saves 50%. Stack them together and you save up to 95%.

This is Article 25 in the Claude AI — From Zero to Power User series. You should know Understanding Models and Extended Thinking before this article.

The Cost Equation

Your Claude API cost is determined by three factors:

Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price)

Each model has different prices per million tokens:

Model	Input (per 1M)	Output (per 1M)
Opus 4.6	$5.00	$25.00
Sonnet 4.6	$3.00	$15.00
Haiku 4.5	$1.00	$5.00

Output tokens cost 5x more than input tokens. This means the biggest savings come from reducing output.

Strategy 1: Model Selection

Use the cheapest model that meets your quality needs.

The Model Ladder

Haiku 4.5 → Sonnet 4.6 → Opus 4.6
  Cheap        Sweet spot     Best quality
  $1/$5        $3/$15         $5/$25

When to Use Each Model

Use Case	Model	Why
Classification	Haiku 4.5	Simple yes/no decisions
Data extraction	Haiku 4.5	Structured output from templates
Code review triage	Haiku 4.5	Quick scan, flag files
General coding	Sonnet 4.6	Best price/quality
Content writing	Sonnet 4.6	Good enough for most content
Complex algorithms	Opus 4.6	Needs deep reasoning
Legal/medical analysis	Opus 4.6	Accuracy is critical

Python: Dynamic Model Selection

import anthropic

client = anthropic.Anthropic()

def select_model(task_type: str, input_length: int) -> str:
    """Select the best model based on task and input size."""
    if task_type in ("classify", "extract", "triage"):
        return "claude-haiku-4-5"
    elif task_type in ("complex", "legal", "algorithm"):
        return "claude-opus-4-6"
    elif input_length > 100000:
        return "claude-sonnet-4-6"  # Large context, needs good model
    else:
        return "claude-sonnet-4-6"  # Default

# Example: classify emails with Haiku (cheap)
model = select_model("classify", 500)
response = client.messages.create(
    model=model,
    max_tokens=100,
    messages=[{"role": "user", "content": "Classify: 'Your order has shipped' → spam or not_spam"}],
)
# Cost: ~$0.0005 per classification

Savings Example

Processing 10,000 classifications per day:

Model	Cost per day	Cost per month
Opus 4.6	$5.00	$150.00
Sonnet 4.6	$3.00	$90.00
Haiku 4.5	$1.00	$30.00

Switching from Opus to Haiku saves $120/month with no quality loss for simple classifications.

Strategy 2: Prompt Caching

Prompt caching lets you cache the system prompt and other static content. Cached tokens cost 90% less on read.

How It Works

First request: Write to cache (25% extra cost for writing)
Subsequent requests: Read from cache (90% cheaper)
Cache TTL: 5 minutes (resets on each use)

Python

# Large system prompt that stays the same across requests
system_prompt = """You are a code reviewer for a Python FastAPI project.
[... 2000 words of review instructions ...]"""

# First request — writes to cache
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Review: def foo(): pass"}],
)

print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

# Subsequent requests — reads from cache (90% cheaper input)
for diff in diffs:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[{"role": "user", "content": f"Review: {diff}"}],
    )

TypeScript

const systemPrompt = `You are a code reviewer...`; // 2000 words

// Cached — 90% cheaper on repeat
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: systemPrompt,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: `Review: ${diff}` }],
});

When Caching Saves Money

Caching is profitable when you make multiple requests with the same cached content within 5 minutes.

Cached Tokens	Requests	Without Cache	With Cache	Savings
5,000	10	$0.15	$0.03	80%
10,000	50	$1.50	$0.18	88%
50,000	100	$15.00	$1.65	89%

The breakeven point is about 3-4 requests within the TTL window.

Strategy 3: Batch API

The Batch API processes requests asynchronously at 50% off. Results come back within 24 hours (usually faster).

Python

import anthropic
import time

client = anthropic.Anthropic()

# Create a batch of requests
batch = client.batches.create(
    requests=[
        {
            "custom_id": f"review-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 2048,
                "messages": [
                    {"role": "user", "content": f"Review this code:\n{code}"}
                ],
            },
        }
        for i, code in enumerate(code_files)
    ],
)

print(f"Batch created: {batch.id}")
print(f"Status: {batch.processing_status}")

# Poll for completion
while True:
    batch = client.batches.retrieve(batch.id)
    if batch.processing_status == "ended":
        break
    print(f"Status: {batch.processing_status}...")
    time.sleep(30)

# Get results
results = list(client.batches.results(batch.id))
for result in results:
    print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}")

TypeScript

const batch = await client.batches.create({
  requests: codeFiles.map((code, i) => ({
    custom_id: `review-${i}`,
    params: {
      model: "claude-sonnet-4-6",
      max_tokens: 2048,
      messages: [
        { role: "user", content: `Review this code:\n${code}` },
      ],
    },
  })),
});

console.log(`Batch created: ${batch.id}`);

When to Use Batch API

Code reviews that are not time-sensitive
Content generation (articles, summaries)
Data processing (classification, extraction)
Nightly analysis jobs
Any workload where 24-hour turnaround is acceptable

Strategy 4: Stack Discounts (Batch + Caching)

Combine Batch API (50% off) with prompt caching (90% off cached tokens) for maximum savings:

batch = client.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 2048,
                "system": [
                    {
                        "type": "text",
                        "text": large_system_prompt,  # Cached across batch
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
                "messages": [
                    {"role": "user", "content": f"Process: {item}"}
                ],
            },
        }
        for i, item in enumerate(items)
    ],
)

Savings Calculation

Processing 1,000 items with 10K cached tokens and 500 output tokens each:

Strategy	Input Cost	Output Cost	Total	Savings
None (Sonnet)	$30.00	$7.50	$37.50	—
Caching only	$3.75	$7.50	$11.25	70%
Batch only	$15.00	$3.75	$18.75	50%
Batch + Cache	$1.88	$3.75	$5.63	85%

Strategy 5: Control Output Tokens

Output tokens cost 5x more than input. Reduce them:

Set Appropriate max_tokens

# Bad — default max_tokens is wasteful for short answers
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,  # You only need 100 tokens for classification
    messages=[{"role": "user", "content": "Is this spam? Yes or no."}],
)

# Good — set max_tokens to what you actually need
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=50,
    messages=[{"role": "user", "content": "Is this spam? Yes or no."}],
)

Ask for Concise Responses

system = """Be concise. Answer in 1-2 sentences maximum.
Do not explain your reasoning unless asked.
For yes/no questions, respond with just Yes or No."""

Use Structured Output

JSON responses are typically shorter than prose:

system = """Return JSON only. No explanation, no markdown.
Format: {"category": "...", "confidence": 0.95}"""

Strategy 6: Manage Context Length

Longer conversations use more input tokens on every request. Manage the context:

def trim_conversation(messages: list, max_tokens: int = 50000) -> list:
    """Keep conversation under token budget."""
    total = sum(len(str(m["content"])) // 4 for m in messages)

    while total > max_tokens and len(messages) > 2:
        # Remove oldest messages (keep first for context)
        messages.pop(1)
        total = sum(len(str(m["content"])) // 4 for m in messages)

    return messages

def summarize_conversation(messages: list) -> str:
    """Summarize old messages to save tokens."""
    old_messages = messages[:-4]  # Keep last 4 messages
    summary = client.messages.create(
        model="claude-haiku-4-5",  # Use Haiku for summarization
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 2-3 sentences:\n{json.dumps(old_messages)}",
        }],
    )
    return summary.content[0].text

Strategy 7: Extended Thinking Budget

Extended thinking tokens cost the same as output tokens. Set an appropriate budget:

# Expensive — large thinking budget for a simple task
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=64000,
    thinking={"type": "enabled", "budget_tokens": 50000},
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
)

# Better — only use thinking for complex tasks
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8192,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Implement a balanced BST with delete operation"}],
)

Only enable extended thinking when the task requires multi-step reasoning.

Monitoring Costs

Track your spending with usage data from each response:

def track_cost(response, model: str) -> dict:
    """Calculate cost from response usage."""
    prices = {
        "claude-opus-4-6": {"input": 5.0, "output": 25.0},
        "claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
        "claude-haiku-4-5": {"input": 1.0, "output": 5.0},
    }

    price = prices.get(model, prices["claude-sonnet-4-6"])

    # Cache-aware cost calculation
    cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
    cache_write = getattr(response.usage, "cache_creation_input_tokens", 0)
    regular_input = response.usage.input_tokens - cache_read

    input_cost = regular_input * price["input"] / 1_000_000
    cache_read_cost = cache_read * price["input"] * 0.1 / 1_000_000  # 90% off
    cache_write_cost = cache_write * price["input"] * 1.25 / 1_000_000  # 25% extra
    output_cost = response.usage.output_tokens * price["output"] / 1_000_000

    total_cost = input_cost + cache_read_cost + cache_write_cost + output_cost
    savings = cache_read * price["input"] * 0.9 / 1_000_000  # What you saved

    return {
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cache_read_tokens": cache_read,
        "input_cost": input_cost + cache_read_cost + cache_write_cost,
        "output_cost": output_cost,
        "total_cost": total_cost,
        "cache_savings": savings,
    }

# Track every request
cost = track_cost(response, "claude-sonnet-4-6")
print(f"Cost: ${cost['total_cost']:.4f} (saved ${cost['cache_savings']:.4f} from cache)")

Real-World Cost Example

A typical SaaS application processing 10,000 requests per day:

Request Type	Model	Requests	Cost/Request	Daily Cost
Email classification	Haiku 4.5	5,000	$0.001	$5.00
Content moderation	Haiku 4.5	3,000	$0.001	$3.00
Code review	Sonnet 4.6 (cached)	500	$0.005	$2.50
Complex analysis	Opus 4.6	100	$0.10	$10.00
Batch processing	Sonnet 4.6 (batch)	1,400	$0.005	$7.00
Total		10,000		$27.50/day

That is about $825/month for 10,000 requests per day — less than the cost of one developer’s time.

Cost Optimization Checklist

Use Haiku for simple tasks (classification, extraction, triage)
Use Sonnet for standard tasks (code, content, analysis)
Use Opus only for complex reasoning
Enable prompt caching for repeated system prompts
Use Batch API for non-urgent workloads
Stack Batch + caching for maximum savings
Set appropriate max_tokens limits
Ask for concise, structured output
Trim conversation history to control context length
Set appropriate extended thinking budgets
Monitor costs on every request
Review and optimize monthly

Summary

Strategy	Savings	When to Use
Model selection	Up to 95%	Always — use cheapest model that works
Prompt caching	90% on cached tokens	Repeated system prompts or context
Batch API	50%	Async workloads (24h turnaround OK)
Batch + Cache	Up to 95%	Large batch jobs with shared context
Output control	20-50%	Set max_tokens, ask for concise replies
Context management	30-60%	Long conversations

What’s Next?

In the next article, we will create a Claude Cheat Sheet — a quick reference for all APIs, models, and tips on one page.

Next: Claude Cheat Sheet — All APIs, Models, Tips

The Cost Equation#

Strategy 1: Model Selection#

The Model Ladder#

When to Use Each Model#

Python: Dynamic Model Selection#

Savings Example#

Strategy 2: Prompt Caching#

How It Works#

Python#

TypeScript#

When Caching Saves Money#

Strategy 3: Batch API#

Python#

TypeScript#

When to Use Batch API#

Strategy 4: Stack Discounts (Batch + Caching)#

Savings Calculation#

Strategy 5: Control Output Tokens#

Set Appropriate max_tokens#

Ask for Concise Responses#

Use Structured Output#

Strategy 6: Manage Context Length#

Strategy 7: Extended Thinking Budget#

Monitoring Costs#

Real-World Cost Example#

Cost Optimization Checklist#

Summary#

What’s Next?#

Related Articles#

The Cost Equation

Strategy 1: Model Selection

The Model Ladder

When to Use Each Model

Python: Dynamic Model Selection

Savings Example

Strategy 2: Prompt Caching

How It Works

Python

TypeScript

When Caching Saves Money

Strategy 3: Batch API

Python

TypeScript

When to Use Batch API

Strategy 4: Stack Discounts (Batch + Caching)

Savings Calculation

Strategy 5: Control Output Tokens

Set Appropriate max_tokens

Ask for Concise Responses

Use Structured Output

Strategy 6: Manage Context Length

Strategy 7: Extended Thinking Budget

Monitoring Costs

Real-World Cost Example

Cost Optimization Checklist

Summary

What’s Next?

Related Articles