Claude API costs add up fast if you are not careful. A simple change like enabling prompt caching can cut your bill by 90%. Using the Batch API saves 50%. Stack them together and you save up to 95%.

This is Article 25 in the Claude AI — From Zero to Power User series. You should know Understanding Models and Extended Thinking before this article.


The Cost Equation

Your Claude API cost is determined by three factors:

Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price)

Each model has different prices per million tokens:

ModelInput (per 1M)Output (per 1M)
Opus 4.6$5.00$25.00
Sonnet 4.6$3.00$15.00
Haiku 4.5$1.00$5.00

Output tokens cost 5x more than input tokens. This means the biggest savings come from reducing output.


Strategy 1: Model Selection

Use the cheapest model that meets your quality needs.

The Model Ladder

Haiku 4.5 → Sonnet 4.6 → Opus 4.6
  Cheap        Sweet spot     Best quality
  $1/$5        $3/$15         $5/$25

When to Use Each Model

Use CaseModelWhy
ClassificationHaiku 4.5Simple yes/no decisions
Data extractionHaiku 4.5Structured output from templates
Code review triageHaiku 4.5Quick scan, flag files
General codingSonnet 4.6Best price/quality
Content writingSonnet 4.6Good enough for most content
Complex algorithmsOpus 4.6Needs deep reasoning
Legal/medical analysisOpus 4.6Accuracy is critical

Python: Dynamic Model Selection

import anthropic

client = anthropic.Anthropic()

def select_model(task_type: str, input_length: int) -> str:
    """Select the best model based on task and input size."""
    if task_type in ("classify", "extract", "triage"):
        return "claude-haiku-4-5"
    elif task_type in ("complex", "legal", "algorithm"):
        return "claude-opus-4-6"
    elif input_length > 100000:
        return "claude-sonnet-4-6"  # Large context, needs good model
    else:
        return "claude-sonnet-4-6"  # Default

# Example: classify emails with Haiku (cheap)
model = select_model("classify", 500)
response = client.messages.create(
    model=model,
    max_tokens=100,
    messages=[{"role": "user", "content": "Classify: 'Your order has shipped' → spam or not_spam"}],
)
# Cost: ~$0.0005 per classification

Savings Example

Processing 10,000 classifications per day:

ModelCost per dayCost per month
Opus 4.6$5.00$150.00
Sonnet 4.6$3.00$90.00
Haiku 4.5$1.00$30.00

Switching from Opus to Haiku saves $120/month with no quality loss for simple classifications.


Strategy 2: Prompt Caching

Prompt caching lets you cache the system prompt and other static content. Cached tokens cost 90% less on read.

How It Works

  1. First request: Write to cache (25% extra cost for writing)
  2. Subsequent requests: Read from cache (90% cheaper)
  3. Cache TTL: 5 minutes (resets on each use)

Python

# Large system prompt that stays the same across requests
system_prompt = """You are a code reviewer for a Python FastAPI project.
[... 2000 words of review instructions ...]"""

# First request — writes to cache
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Review: def foo(): pass"}],
)

print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

# Subsequent requests — reads from cache (90% cheaper input)
for diff in diffs:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[{"role": "user", "content": f"Review: {diff}"}],
    )

TypeScript

const systemPrompt = `You are a code reviewer...`; // 2000 words

// Cached — 90% cheaper on repeat
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: systemPrompt,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: `Review: ${diff}` }],
});

When Caching Saves Money

Caching is profitable when you make multiple requests with the same cached content within 5 minutes.

Cached TokensRequestsWithout CacheWith CacheSavings
5,00010$0.15$0.0380%
10,00050$1.50$0.1888%
50,000100$15.00$1.6589%

The breakeven point is about 3-4 requests within the TTL window.


Strategy 3: Batch API

The Batch API processes requests asynchronously at 50% off. Results come back within 24 hours (usually faster).

Python

import anthropic
import time

client = anthropic.Anthropic()

# Create a batch of requests
batch = client.batches.create(
    requests=[
        {
            "custom_id": f"review-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 2048,
                "messages": [
                    {"role": "user", "content": f"Review this code:\n{code}"}
                ],
            },
        }
        for i, code in enumerate(code_files)
    ],
)

print(f"Batch created: {batch.id}")
print(f"Status: {batch.processing_status}")

# Poll for completion
while True:
    batch = client.batches.retrieve(batch.id)
    if batch.processing_status == "ended":
        break
    print(f"Status: {batch.processing_status}...")
    time.sleep(30)

# Get results
results = list(client.batches.results(batch.id))
for result in results:
    print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}")

TypeScript

const batch = await client.batches.create({
  requests: codeFiles.map((code, i) => ({
    custom_id: `review-${i}`,
    params: {
      model: "claude-sonnet-4-6",
      max_tokens: 2048,
      messages: [
        { role: "user", content: `Review this code:\n${code}` },
      ],
    },
  })),
});

console.log(`Batch created: ${batch.id}`);

When to Use Batch API

  • Code reviews that are not time-sensitive
  • Content generation (articles, summaries)
  • Data processing (classification, extraction)
  • Nightly analysis jobs
  • Any workload where 24-hour turnaround is acceptable

Strategy 4: Stack Discounts (Batch + Caching)

Combine Batch API (50% off) with prompt caching (90% off cached tokens) for maximum savings:

batch = client.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 2048,
                "system": [
                    {
                        "type": "text",
                        "text": large_system_prompt,  # Cached across batch
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
                "messages": [
                    {"role": "user", "content": f"Process: {item}"}
                ],
            },
        }
        for i, item in enumerate(items)
    ],
)

Savings Calculation

Processing 1,000 items with 10K cached tokens and 500 output tokens each:

StrategyInput CostOutput CostTotalSavings
None (Sonnet)$30.00$7.50$37.50
Caching only$3.75$7.50$11.2570%
Batch only$15.00$3.75$18.7550%
Batch + Cache$1.88$3.75$5.6385%

Strategy 5: Control Output Tokens

Output tokens cost 5x more than input. Reduce them:

Set Appropriate max_tokens

# Bad — default max_tokens is wasteful for short answers
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,  # You only need 100 tokens for classification
    messages=[{"role": "user", "content": "Is this spam? Yes or no."}],
)

# Good — set max_tokens to what you actually need
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=50,
    messages=[{"role": "user", "content": "Is this spam? Yes or no."}],
)

Ask for Concise Responses

system = """Be concise. Answer in 1-2 sentences maximum.
Do not explain your reasoning unless asked.
For yes/no questions, respond with just Yes or No."""

Use Structured Output

JSON responses are typically shorter than prose:

system = """Return JSON only. No explanation, no markdown.
Format: {"category": "...", "confidence": 0.95}"""

Strategy 6: Manage Context Length

Longer conversations use more input tokens on every request. Manage the context:

def trim_conversation(messages: list, max_tokens: int = 50000) -> list:
    """Keep conversation under token budget."""
    total = sum(len(str(m["content"])) // 4 for m in messages)

    while total > max_tokens and len(messages) > 2:
        # Remove oldest messages (keep first for context)
        messages.pop(1)
        total = sum(len(str(m["content"])) // 4 for m in messages)

    return messages

def summarize_conversation(messages: list) -> str:
    """Summarize old messages to save tokens."""
    old_messages = messages[:-4]  # Keep last 4 messages
    summary = client.messages.create(
        model="claude-haiku-4-5",  # Use Haiku for summarization
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 2-3 sentences:\n{json.dumps(old_messages)}",
        }],
    )
    return summary.content[0].text

Strategy 7: Extended Thinking Budget

Extended thinking tokens cost the same as output tokens. Set an appropriate budget:

# Expensive — large thinking budget for a simple task
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=64000,
    thinking={"type": "enabled", "budget_tokens": 50000},
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
)

# Better — only use thinking for complex tasks
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8192,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Implement a balanced BST with delete operation"}],
)

Only enable extended thinking when the task requires multi-step reasoning.


Monitoring Costs

Track your spending with usage data from each response:

def track_cost(response, model: str) -> dict:
    """Calculate cost from response usage."""
    prices = {
        "claude-opus-4-6": {"input": 5.0, "output": 25.0},
        "claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
        "claude-haiku-4-5": {"input": 1.0, "output": 5.0},
    }

    price = prices.get(model, prices["claude-sonnet-4-6"])

    # Cache-aware cost calculation
    cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
    cache_write = getattr(response.usage, "cache_creation_input_tokens", 0)
    regular_input = response.usage.input_tokens - cache_read

    input_cost = regular_input * price["input"] / 1_000_000
    cache_read_cost = cache_read * price["input"] * 0.1 / 1_000_000  # 90% off
    cache_write_cost = cache_write * price["input"] * 1.25 / 1_000_000  # 25% extra
    output_cost = response.usage.output_tokens * price["output"] / 1_000_000

    total_cost = input_cost + cache_read_cost + cache_write_cost + output_cost
    savings = cache_read * price["input"] * 0.9 / 1_000_000  # What you saved

    return {
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cache_read_tokens": cache_read,
        "input_cost": input_cost + cache_read_cost + cache_write_cost,
        "output_cost": output_cost,
        "total_cost": total_cost,
        "cache_savings": savings,
    }

# Track every request
cost = track_cost(response, "claude-sonnet-4-6")
print(f"Cost: ${cost['total_cost']:.4f} (saved ${cost['cache_savings']:.4f} from cache)")

Real-World Cost Example

A typical SaaS application processing 10,000 requests per day:

Request TypeModelRequestsCost/RequestDaily Cost
Email classificationHaiku 4.55,000$0.001$5.00
Content moderationHaiku 4.53,000$0.001$3.00
Code reviewSonnet 4.6 (cached)500$0.005$2.50
Complex analysisOpus 4.6100$0.10$10.00
Batch processingSonnet 4.6 (batch)1,400$0.005$7.00
Total10,000$27.50/day

That is about $825/month for 10,000 requests per day — less than the cost of one developer’s time.


Cost Optimization Checklist

  1. Use Haiku for simple tasks (classification, extraction, triage)
  2. Use Sonnet for standard tasks (code, content, analysis)
  3. Use Opus only for complex reasoning
  4. Enable prompt caching for repeated system prompts
  5. Use Batch API for non-urgent workloads
  6. Stack Batch + caching for maximum savings
  7. Set appropriate max_tokens limits
  8. Ask for concise, structured output
  9. Trim conversation history to control context length
  10. Set appropriate extended thinking budgets
  11. Monitor costs on every request
  12. Review and optimize monthly

Summary

StrategySavingsWhen to Use
Model selectionUp to 95%Always — use cheapest model that works
Prompt caching90% on cached tokensRepeated system prompts or context
Batch API50%Async workloads (24h turnaround OK)
Batch + CacheUp to 95%Large batch jobs with shared context
Output control20-50%Set max_tokens, ask for concise replies
Context management30-60%Long conversations

What’s Next?

In the next article, we will create a Claude Cheat Sheet — a quick reference for all APIs, models, and tips on one page.

Next: Claude Cheat Sheet — All APIs, Models, Tips