A prompt that works on three examples might fail on the fourth. Prompt engineering without testing is guessing. In this article, you will learn how to test prompts systematically, measure quality, and iterate until you get reliable results.

This is Article 18 in the Claude AI — From Zero to Power User series. You should know Prompt Engineering Basics before this article.

By the end, you will have an eval harness that tests your prompts automatically and a process for improving them.


Why Test Prompts?

Prompts are code. They break, they regress, and they behave differently with different inputs. Testing prompts gives you three things:

  1. Consistency — The prompt works the same way every time
  2. Regression prevention — Changes to the prompt do not break existing behavior
  3. Measurable improvement — You can prove that version B is better than version A

Without testing, you are flying blind. You change a prompt, test it manually on one or two inputs, and hope for the best. With testing, you run 50 inputs and get a score.


Designing an Evaluation Dataset

An eval dataset is a list of inputs with expected outputs. Build one before you start optimizing.

Python

# eval_dataset.py
eval_cases = [
    {
        "input": "def divide(a, b): return a / b",
        "expected_issues": ["division by zero"],
        "expected_severity": "critical",
    },
    {
        "input": "password = 'admin123'",
        "expected_issues": ["hardcoded password"],
        "expected_severity": "critical",
    },
    {
        "input": "def add(a: int, b: int) -> int: return a + b",
        "expected_issues": [],
        "expected_severity": None,  # No issues expected
    },
    {
        "input": "query = f'SELECT * FROM users WHERE id = {user_id}'",
        "expected_issues": ["SQL injection"],
        "expected_severity": "critical",
    },
    {
        "input": "data = json.loads(response.text)\nname = data['name']",
        "expected_issues": ["KeyError"],
        "expected_severity": "warning",
    },
]

Tips for Good Eval Datasets

  • Include at least 20-30 cases. Fewer cases give unreliable scores.
  • Include edge cases. Empty inputs, very long inputs, unusual formats.
  • Include negative examples. Cases where the correct answer is “no issues found.”
  • Cover all categories. If your prompt handles bugs, security, and performance — include examples of each.
  • Update over time. When you find a failure in production, add it to the dataset.

Building a Simple Eval Harness

Python

import anthropic
import json

client = anthropic.Anthropic()

def run_eval(system_prompt: str, eval_cases: list[dict]) -> dict:
    """Run an evaluation and return scores."""
    results = {"total": len(eval_cases), "passed": 0, "failed": 0, "details": []}

    for i, case in enumerate(eval_cases):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            temperature=0,  # Deterministic for testing
            system=system_prompt,
            messages=[
                {"role": "user", "content": f"Review this code:\n```\n{case['input']}\n```"}
            ],
        )

        output = response.content[0].text.lower()

        # Check if expected issues were found
        passed = True
        for issue in case["expected_issues"]:
            if issue.lower() not in output:
                passed = False
                break

        # Check for false positives (no issues expected but issues reported)
        if not case["expected_issues"] and "no issues" not in output and "looks good" not in output:
            passed = False

        results["details"].append({
            "case": i,
            "passed": passed,
            "input": case["input"][:50],
            "output": output[:200],
        })

        if passed:
            results["passed"] += 1
        else:
            results["failed"] += 1

    results["score"] = results["passed"] / results["total"]
    return results

# Test prompt version A
prompt_a = """You are a code reviewer. Find bugs and security issues.
Rate each issue as critical, warning, or info.
If the code is fine, say "No issues found"."""

results = run_eval(prompt_a, eval_cases)
print(f"Score: {results['score']:.1%} ({results['passed']}/{results['total']})")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface EvalCase {
  input: string;
  expectedIssues: string[];
  expectedSeverity: string | null;
}

interface EvalResults {
  total: number;
  passed: number;
  failed: number;
  score: number;
  details: Array<{
    case: number;
    passed: boolean;
    input: string;
    output: string;
  }>;
}

async function runEval(
  systemPrompt: string,
  evalCases: EvalCase[]
): Promise<EvalResults> {
  const results: EvalResults = {
    total: evalCases.length,
    passed: 0,
    failed: 0,
    score: 0,
    details: [],
  };

  for (let i = 0; i < evalCases.length; i++) {
    const evalCase = evalCases[i];

    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      temperature: 0,
      system: systemPrompt,
      messages: [
        {
          role: "user",
          content: `Review this code:\n\`\`\`\n${evalCase.input}\n\`\`\``,
        },
      ],
    });

    const output =
      response.content[0].type === "text"
        ? response.content[0].text.toLowerCase()
        : "";

    let passed = true;
    for (const issue of evalCase.expectedIssues) {
      if (!output.includes(issue.toLowerCase())) {
        passed = false;
        break;
      }
    }

    if (
      evalCase.expectedIssues.length === 0 &&
      !output.includes("no issues") &&
      !output.includes("looks good")
    ) {
      passed = false;
    }

    results.details.push({
      case: i,
      passed,
      input: evalCase.input.slice(0, 50),
      output: output.slice(0, 200),
    });

    if (passed) results.passed++;
    else results.failed++;
  }

  results.score = results.passed / results.total;
  return results;
}

A/B Testing Prompts

Test two prompt versions against the same dataset to see which one performs better:

Python

# Prompt version A — basic
prompt_a = """You are a code reviewer. Find bugs and security issues."""

# Prompt version B — structured with XML tags
prompt_b = """You are a senior code reviewer specializing in security.

<rules>
- Focus on: security vulnerabilities, unhandled exceptions, null/undefined risks
- Rate each issue as: CRITICAL, WARNING, or INFO
- If the code has no issues, respond with exactly: "No issues found"
- Do not invent problems — only report real issues
</rules>

<output_format>
- **[SEVERITY] Issue title**
  Problem: description
  Fix: corrected code
</output_format>"""

# Run both
results_a = run_eval(prompt_a, eval_cases)
results_b = run_eval(prompt_b, eval_cases)

print(f"Prompt A: {results_a['score']:.1%}")
print(f"Prompt B: {results_b['score']:.1%}")
print(f"Improvement: {(results_b['score'] - results_a['score']):.1%}")

Typical results: The structured prompt (B) scores 15-30% higher than the basic prompt (A). XML tags, explicit rules, and output format make a big difference.


Model-Graded Evaluation

For complex outputs where exact string matching is not enough, use a second Claude call to grade the output:

Python

def grade_output(question: str, output: str, reference: str) -> dict:
    """Use Claude to grade another Claude's output."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        temperature=0,
        system="""You are an evaluation judge. Grade the output on:
1. Accuracy (0-10): Is the information correct?
2. Completeness (0-10): Does it cover all expected points?
3. Format (0-10): Does it follow the expected format?

Return JSON: {"accuracy": N, "completeness": N, "format": N, "overall": N, "notes": "..."}""",
        messages=[
            {
                "role": "user",
                "content": f"""<question>{question}</question>
<output>{output}</output>
<reference>{reference}</reference>

Grade the output against the reference answer.""",
            }
        ],
    )

    return json.loads(response.content[0].text)

# Grade each eval case
for case in eval_cases:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        temperature=0,
        system=ACTIVE_PROMPT,
        messages=[{"role": "user", "content": f"Review this code:\n```\n{case['input']}\n```"}],
    )
    output = response.content[0].text
    grade = grade_output(case["input"], output, str(case["expected_issues"]))
    print(f"Case: {case['input'][:30]}... Score: {grade['overall']}/10")

Model-graded evaluation costs more (two API calls per case), but it catches subtle issues that string matching misses.


Using promptfoo for Prompt Testing

promptfoo is an open-source CLI tool for testing prompts. It supports Claude, runs tests automatically, and generates comparison reports.

Installation

npm install -g promptfoo

Configuration

Create a promptfooconfig.yaml file:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      temperature: 0

prompts:
  - id: prompt-v1
    raw: |
      You are a code reviewer. Find bugs and security issues.
      Review this code: {{input}}

  - id: prompt-v2
    raw: |
      You are a senior code reviewer specializing in security.
      <rules>
      - Focus on security vulnerabilities and unhandled exceptions
      - Rate severity as CRITICAL, WARNING, or INFO
      - Say "No issues found" if the code is clean
      </rules>
      Review this code: {{input}}

tests:
  - vars:
      input: "def divide(a, b): return a / b"
    assert:
      - type: contains
        value: "division by zero"
      - type: contains
        value: "critical"

  - vars:
      input: "password = 'admin123'"
    assert:
      - type: contains
        value: "hardcoded"

  - vars:
      input: "query = f'SELECT * FROM users WHERE id = {user_id}'"
    assert:
      - type: contains
        value: "SQL injection"

  - vars:
      input: "def add(a: int, b: int) -> int: return a + b"
    assert:
      - type: icontains
        value: "no issues"

Run Tests

# Run all tests
promptfoo eval

# View results in browser
promptfoo view

# Compare two prompt versions
promptfoo eval --output results.json

promptfoo generates a side-by-side comparison showing which prompt version wins on each test case. It also supports:

  • LLM-graded assertions — Use Claude to judge if the output is correct
  • Multiple providers — Compare Claude vs GPT vs Gemini on the same tests
  • CI integration — Run tests on every code change
  • Custom scorers — Write your own evaluation functions

Temperature Experiments

Temperature affects output consistency. Lower temperature means more deterministic outputs. Run the same prompt multiple times at different temperatures:

def temperature_experiment(prompt: str, input_text: str, temps: list[float], runs: int = 5):
    """Test how temperature affects output consistency."""
    for temp in temps:
        outputs = []
        for _ in range(runs):
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                temperature=temp,
                system=prompt,
                messages=[{"role": "user", "content": input_text}],
            )
            outputs.append(response.content[0].text)

        # Measure consistency: how many unique outputs?
        unique = len(set(outputs))
        print(f"Temperature {temp}: {unique}/{runs} unique outputs")

temperature_experiment(
    "Classify this code as: bug, security, performance, or clean.",
    "query = f'SELECT * FROM users WHERE id = {user_id}'",
    temps=[0, 0.3, 0.5, 0.8, 1.0],
)

For classification and evaluation tasks, use temperature 0. For creative tasks, use 0.5-0.8.


Prompt Versioning

Track your prompts like code. Keep a version history with scores:

# prompt_versions.py
PROMPT_VERSIONS = {
    "v1": {
        "prompt": "You are a code reviewer. Find bugs.",
        "score": 0.60,
        "date": "2026-07-01",
        "notes": "Initial version, too vague",
    },
    "v2": {
        "prompt": """You are a code reviewer. Find bugs and security issues.
Rate as critical, warning, or info.""",
        "score": 0.72,
        "date": "2026-07-02",
        "notes": "Added severity ratings",
    },
    "v3": {
        "prompt": """You are a senior code reviewer specializing in security.
<rules>
- Focus on security vulnerabilities and unhandled exceptions
- Rate as CRITICAL, WARNING, or INFO
- Say "No issues found" if code is clean
</rules>""",
        "score": 0.88,
        "date": "2026-07-03",
        "notes": "Added XML tags and explicit rules",
    },
}

# Always deploy the best version
ACTIVE_PROMPT = PROMPT_VERSIONS["v3"]["prompt"]

Common Failure Modes

When your eval score is low, check for these common problems:

Failure ModeSymptomFix
HallucinationClaude invents issues that do not existAdd “Do not invent problems” to the prompt
Format driftOutput format changes between requestsAdd explicit <output_format> to system prompt
Instruction ignoringClaude skips rules in long promptsMove important rules to the beginning, use XML tags
False negativesClaude misses real issuesAdd few-shot examples of the issues
Over-reportingClaude flags everything as criticalAdd severity definitions with examples

Automating Evals

Run evals automatically on every prompt change:

# eval_runner.py
import sys

def run_all_evals():
    """Run eval suite and fail if score drops below threshold."""
    threshold = 0.80

    results = run_eval(ACTIVE_PROMPT, eval_cases)

    print(f"\nEval Results:")
    print(f"  Score: {results['score']:.1%}")
    print(f"  Passed: {results['passed']}/{results['total']}")
    print(f"  Threshold: {threshold:.1%}")

    if results["score"] < threshold:
        print(f"\n  FAILED: Score {results['score']:.1%} is below threshold {threshold:.1%}")
        for detail in results["details"]:
            if not detail["passed"]:
                print(f"    Case {detail['case']}: {detail['input']}...")
        sys.exit(1)
    else:
        print(f"\n  PASSED")

if __name__ == "__main__":
    run_all_evals()

Run this in CI/CD to catch prompt regressions before they reach production.


Real-World Example: Improving a Code Review Prompt

Here is a real improvement process, from 6/10 to 9/10:

  1. Version 1 (Score: 6/10): “Review this code for bugs.” — Too vague. Missed security issues, inconsistent format.

  2. Version 2 (Score: 7/10): Added severity ratings. — Better structure, but still missed some edge cases.

  3. Version 3 (Score: 8/10): Added XML tags, explicit rules, and “say no issues if clean.” — Eliminated false positives.

  4. Version 4 (Score: 9/10): Added one few-shot example. — Claude now matches the exact output format on every input.

The entire process took four iterations over two days. Each iteration was tested on 30 eval cases.


Summary

ConceptDetails
Eval dataset20-30 test cases with expected outputs
Eval harnessAutomate testing across all cases
A/B testingCompare two prompt versions on same data
Model-gradedUse Claude to judge Claude’s output
promptfooCLI tool for prompt testing at scale
TemperatureUse 0 for testing, 0.5-0.8 for creative tasks
VersioningTrack prompts with scores and dates

What’s Next?

In the next article, we will cover Claude for code generation — best practices for getting high-quality code from Claude.

Next: Claude for Code Generation — Best Practices