A prompt that works on three examples might fail on the fourth. Prompt engineering without testing is guessing. In this article, you will learn how to test prompts systematically, measure quality, and iterate until you get reliable results.
This is Article 18 in the Claude AI — From Zero to Power User series. You should know Prompt Engineering Basics before this article.
By the end, you will have an eval harness that tests your prompts automatically and a process for improving them.
Why Test Prompts?
Prompts are code. They break, they regress, and they behave differently with different inputs. Testing prompts gives you three things:
- Consistency — The prompt works the same way every time
- Regression prevention — Changes to the prompt do not break existing behavior
- Measurable improvement — You can prove that version B is better than version A
Without testing, you are flying blind. You change a prompt, test it manually on one or two inputs, and hope for the best. With testing, you run 50 inputs and get a score.
Designing an Evaluation Dataset
An eval dataset is a list of inputs with expected outputs. Build one before you start optimizing.
Python
# eval_dataset.py
eval_cases = [
{
"input": "def divide(a, b): return a / b",
"expected_issues": ["division by zero"],
"expected_severity": "critical",
},
{
"input": "password = 'admin123'",
"expected_issues": ["hardcoded password"],
"expected_severity": "critical",
},
{
"input": "def add(a: int, b: int) -> int: return a + b",
"expected_issues": [],
"expected_severity": None, # No issues expected
},
{
"input": "query = f'SELECT * FROM users WHERE id = {user_id}'",
"expected_issues": ["SQL injection"],
"expected_severity": "critical",
},
{
"input": "data = json.loads(response.text)\nname = data['name']",
"expected_issues": ["KeyError"],
"expected_severity": "warning",
},
]
Tips for Good Eval Datasets
- Include at least 20-30 cases. Fewer cases give unreliable scores.
- Include edge cases. Empty inputs, very long inputs, unusual formats.
- Include negative examples. Cases where the correct answer is “no issues found.”
- Cover all categories. If your prompt handles bugs, security, and performance — include examples of each.
- Update over time. When you find a failure in production, add it to the dataset.
Building a Simple Eval Harness
Python
import anthropic
import json
client = anthropic.Anthropic()
def run_eval(system_prompt: str, eval_cases: list[dict]) -> dict:
"""Run an evaluation and return scores."""
results = {"total": len(eval_cases), "passed": 0, "failed": 0, "details": []}
for i, case in enumerate(eval_cases):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
temperature=0, # Deterministic for testing
system=system_prompt,
messages=[
{"role": "user", "content": f"Review this code:\n```\n{case['input']}\n```"}
],
)
output = response.content[0].text.lower()
# Check if expected issues were found
passed = True
for issue in case["expected_issues"]:
if issue.lower() not in output:
passed = False
break
# Check for false positives (no issues expected but issues reported)
if not case["expected_issues"] and "no issues" not in output and "looks good" not in output:
passed = False
results["details"].append({
"case": i,
"passed": passed,
"input": case["input"][:50],
"output": output[:200],
})
if passed:
results["passed"] += 1
else:
results["failed"] += 1
results["score"] = results["passed"] / results["total"]
return results
# Test prompt version A
prompt_a = """You are a code reviewer. Find bugs and security issues.
Rate each issue as critical, warning, or info.
If the code is fine, say "No issues found"."""
results = run_eval(prompt_a, eval_cases)
print(f"Score: {results['score']:.1%} ({results['passed']}/{results['total']})")
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface EvalCase {
input: string;
expectedIssues: string[];
expectedSeverity: string | null;
}
interface EvalResults {
total: number;
passed: number;
failed: number;
score: number;
details: Array<{
case: number;
passed: boolean;
input: string;
output: string;
}>;
}
async function runEval(
systemPrompt: string,
evalCases: EvalCase[]
): Promise<EvalResults> {
const results: EvalResults = {
total: evalCases.length,
passed: 0,
failed: 0,
score: 0,
details: [],
};
for (let i = 0; i < evalCases.length; i++) {
const evalCase = evalCases[i];
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
temperature: 0,
system: systemPrompt,
messages: [
{
role: "user",
content: `Review this code:\n\`\`\`\n${evalCase.input}\n\`\`\``,
},
],
});
const output =
response.content[0].type === "text"
? response.content[0].text.toLowerCase()
: "";
let passed = true;
for (const issue of evalCase.expectedIssues) {
if (!output.includes(issue.toLowerCase())) {
passed = false;
break;
}
}
if (
evalCase.expectedIssues.length === 0 &&
!output.includes("no issues") &&
!output.includes("looks good")
) {
passed = false;
}
results.details.push({
case: i,
passed,
input: evalCase.input.slice(0, 50),
output: output.slice(0, 200),
});
if (passed) results.passed++;
else results.failed++;
}
results.score = results.passed / results.total;
return results;
}
A/B Testing Prompts
Test two prompt versions against the same dataset to see which one performs better:
Python
# Prompt version A — basic
prompt_a = """You are a code reviewer. Find bugs and security issues."""
# Prompt version B — structured with XML tags
prompt_b = """You are a senior code reviewer specializing in security.
<rules>
- Focus on: security vulnerabilities, unhandled exceptions, null/undefined risks
- Rate each issue as: CRITICAL, WARNING, or INFO
- If the code has no issues, respond with exactly: "No issues found"
- Do not invent problems — only report real issues
</rules>
<output_format>
- **[SEVERITY] Issue title**
Problem: description
Fix: corrected code
</output_format>"""
# Run both
results_a = run_eval(prompt_a, eval_cases)
results_b = run_eval(prompt_b, eval_cases)
print(f"Prompt A: {results_a['score']:.1%}")
print(f"Prompt B: {results_b['score']:.1%}")
print(f"Improvement: {(results_b['score'] - results_a['score']):.1%}")
Typical results: The structured prompt (B) scores 15-30% higher than the basic prompt (A). XML tags, explicit rules, and output format make a big difference.
Model-Graded Evaluation
For complex outputs where exact string matching is not enough, use a second Claude call to grade the output:
Python
def grade_output(question: str, output: str, reference: str) -> dict:
"""Use Claude to grade another Claude's output."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
temperature=0,
system="""You are an evaluation judge. Grade the output on:
1. Accuracy (0-10): Is the information correct?
2. Completeness (0-10): Does it cover all expected points?
3. Format (0-10): Does it follow the expected format?
Return JSON: {"accuracy": N, "completeness": N, "format": N, "overall": N, "notes": "..."}""",
messages=[
{
"role": "user",
"content": f"""<question>{question}</question>
<output>{output}</output>
<reference>{reference}</reference>
Grade the output against the reference answer.""",
}
],
)
return json.loads(response.content[0].text)
# Grade each eval case
for case in eval_cases:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
temperature=0,
system=ACTIVE_PROMPT,
messages=[{"role": "user", "content": f"Review this code:\n```\n{case['input']}\n```"}],
)
output = response.content[0].text
grade = grade_output(case["input"], output, str(case["expected_issues"]))
print(f"Case: {case['input'][:30]}... Score: {grade['overall']}/10")
Model-graded evaluation costs more (two API calls per case), but it catches subtle issues that string matching misses.
Using promptfoo for Prompt Testing
promptfoo is an open-source CLI tool for testing prompts. It supports Claude, runs tests automatically, and generates comparison reports.
Installation
npm install -g promptfoo
Configuration
Create a promptfooconfig.yaml file:
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
temperature: 0
prompts:
- id: prompt-v1
raw: |
You are a code reviewer. Find bugs and security issues.
Review this code: {{input}}
- id: prompt-v2
raw: |
You are a senior code reviewer specializing in security.
<rules>
- Focus on security vulnerabilities and unhandled exceptions
- Rate severity as CRITICAL, WARNING, or INFO
- Say "No issues found" if the code is clean
</rules>
Review this code: {{input}}
tests:
- vars:
input: "def divide(a, b): return a / b"
assert:
- type: contains
value: "division by zero"
- type: contains
value: "critical"
- vars:
input: "password = 'admin123'"
assert:
- type: contains
value: "hardcoded"
- vars:
input: "query = f'SELECT * FROM users WHERE id = {user_id}'"
assert:
- type: contains
value: "SQL injection"
- vars:
input: "def add(a: int, b: int) -> int: return a + b"
assert:
- type: icontains
value: "no issues"
Run Tests
# Run all tests
promptfoo eval
# View results in browser
promptfoo view
# Compare two prompt versions
promptfoo eval --output results.json
promptfoo generates a side-by-side comparison showing which prompt version wins on each test case. It also supports:
- LLM-graded assertions — Use Claude to judge if the output is correct
- Multiple providers — Compare Claude vs GPT vs Gemini on the same tests
- CI integration — Run tests on every code change
- Custom scorers — Write your own evaluation functions
Temperature Experiments
Temperature affects output consistency. Lower temperature means more deterministic outputs. Run the same prompt multiple times at different temperatures:
def temperature_experiment(prompt: str, input_text: str, temps: list[float], runs: int = 5):
"""Test how temperature affects output consistency."""
for temp in temps:
outputs = []
for _ in range(runs):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
temperature=temp,
system=prompt,
messages=[{"role": "user", "content": input_text}],
)
outputs.append(response.content[0].text)
# Measure consistency: how many unique outputs?
unique = len(set(outputs))
print(f"Temperature {temp}: {unique}/{runs} unique outputs")
temperature_experiment(
"Classify this code as: bug, security, performance, or clean.",
"query = f'SELECT * FROM users WHERE id = {user_id}'",
temps=[0, 0.3, 0.5, 0.8, 1.0],
)
For classification and evaluation tasks, use temperature 0. For creative tasks, use 0.5-0.8.
Prompt Versioning
Track your prompts like code. Keep a version history with scores:
# prompt_versions.py
PROMPT_VERSIONS = {
"v1": {
"prompt": "You are a code reviewer. Find bugs.",
"score": 0.60,
"date": "2026-07-01",
"notes": "Initial version, too vague",
},
"v2": {
"prompt": """You are a code reviewer. Find bugs and security issues.
Rate as critical, warning, or info.""",
"score": 0.72,
"date": "2026-07-02",
"notes": "Added severity ratings",
},
"v3": {
"prompt": """You are a senior code reviewer specializing in security.
<rules>
- Focus on security vulnerabilities and unhandled exceptions
- Rate as CRITICAL, WARNING, or INFO
- Say "No issues found" if code is clean
</rules>""",
"score": 0.88,
"date": "2026-07-03",
"notes": "Added XML tags and explicit rules",
},
}
# Always deploy the best version
ACTIVE_PROMPT = PROMPT_VERSIONS["v3"]["prompt"]
Common Failure Modes
When your eval score is low, check for these common problems:
| Failure Mode | Symptom | Fix |
|---|---|---|
| Hallucination | Claude invents issues that do not exist | Add “Do not invent problems” to the prompt |
| Format drift | Output format changes between requests | Add explicit <output_format> to system prompt |
| Instruction ignoring | Claude skips rules in long prompts | Move important rules to the beginning, use XML tags |
| False negatives | Claude misses real issues | Add few-shot examples of the issues |
| Over-reporting | Claude flags everything as critical | Add severity definitions with examples |
Automating Evals
Run evals automatically on every prompt change:
# eval_runner.py
import sys
def run_all_evals():
"""Run eval suite and fail if score drops below threshold."""
threshold = 0.80
results = run_eval(ACTIVE_PROMPT, eval_cases)
print(f"\nEval Results:")
print(f" Score: {results['score']:.1%}")
print(f" Passed: {results['passed']}/{results['total']}")
print(f" Threshold: {threshold:.1%}")
if results["score"] < threshold:
print(f"\n FAILED: Score {results['score']:.1%} is below threshold {threshold:.1%}")
for detail in results["details"]:
if not detail["passed"]:
print(f" Case {detail['case']}: {detail['input']}...")
sys.exit(1)
else:
print(f"\n PASSED")
if __name__ == "__main__":
run_all_evals()
Run this in CI/CD to catch prompt regressions before they reach production.
Real-World Example: Improving a Code Review Prompt
Here is a real improvement process, from 6/10 to 9/10:
Version 1 (Score: 6/10): “Review this code for bugs.” — Too vague. Missed security issues, inconsistent format.
Version 2 (Score: 7/10): Added severity ratings. — Better structure, but still missed some edge cases.
Version 3 (Score: 8/10): Added XML tags, explicit rules, and “say no issues if clean.” — Eliminated false positives.
Version 4 (Score: 9/10): Added one few-shot example. — Claude now matches the exact output format on every input.
The entire process took four iterations over two days. Each iteration was tested on 30 eval cases.
Summary
| Concept | Details |
|---|---|
| Eval dataset | 20-30 test cases with expected outputs |
| Eval harness | Automate testing across all cases |
| A/B testing | Compare two prompt versions on same data |
| Model-graded | Use Claude to judge Claude’s output |
| promptfoo | CLI tool for prompt testing at scale |
| Temperature | Use 0 for testing, 0.5-0.8 for creative tasks |
| Versioning | Track prompts with scores and dates |
What’s Next?
In the next article, we will cover Claude for code generation — best practices for getting high-quality code from Claude.
Next: Claude for Code Generation — Best Practices