Claude can see your screen and control your computer. It takes screenshots, analyzes what is on screen, then clicks buttons, types text, and navigates applications — just like a human would.
This is Article 16 in the Claude AI — From Zero to Power User series. You should know Tool Use before this article.
By the end, you will build a working desktop automation that fills out a web form automatically.
What is Computer Use?
Computer use is a Claude feature where the model controls a computer through three actions:
- Screenshot — Take a picture of the screen
- Mouse — Click, double-click, drag, scroll
- Keyboard — Type text, press key combinations
Claude sees the screenshot, decides what to do next, and sends a command. The loop repeats until the task is done.
This is different from regular tool use. With regular tools, you define functions. With computer use, Claude interacts with any application — web browsers, desktop apps, terminals — through the visual interface.
Beta Status
Computer use is currently in beta. You need a special header in your API calls:
anthropic-beta: computer-use-2025-11-24
Supported models: Claude Opus 4.6, Sonnet 4.6, Sonnet 4.5, and Haiku 4.5.
How It Works
The computer use loop has four steps:
- You send a task to Claude with the
computer_20241022tool enabled - Claude requests a screenshot
- You take the screenshot and send it back
- Claude analyzes the image and sends mouse/keyboard actions
- Repeat from step 2 until the task is complete
Each round trip is one API call. A simple task like clicking a button takes 2-3 calls. A complex task like filling out a form takes 10-20 calls.
Setting Up the Environment
For safety, always run computer use in an isolated environment. Use a Docker container or a virtual machine. Never give Claude access to your main desktop.
Anthropic provides a reference implementation with Docker:
# Pull the reference implementation
git clone https://github.com/anthropics/anthropic-quickstarts.git
cd anthropic-quickstarts/computer-use-demo
# Set your API key
export ANTHROPIC_API_KEY="your-key-here"
# Run the Docker container
docker compose up
This starts a container with a lightweight desktop environment and a web interface at http://localhost:8080.
Basic Computer Use — Python
Here is a minimal example that sends a computer use task:
Python
import anthropic
import base64
import subprocess
client = anthropic.Anthropic()
def take_screenshot() -> str:
"""Take a screenshot and return base64-encoded image."""
subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
with open("/tmp/screenshot.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode()
def execute_action(action: dict):
"""Execute a mouse or keyboard action using xdotool."""
if action["type"] in ("left_click", "right_click", "double_click"):
x, y = action["coordinate"]
button = "3" if action["type"] == "right_click" else "1"
click_count = "2" if action["type"] == "double_click" else "1"
subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "--repeat", click_count, button])
elif action["type"] == "type":
subprocess.run(["xdotool", "type", "--delay", "50", action["text"]])
elif action["type"] == "key":
subprocess.run(["xdotool", "key", action["text"]])
elif action["type"] == "screenshot":
pass # Screenshot is handled separately
def run_computer_use(task: str):
"""Run a computer use task to completion."""
messages = [{"role": "user", "content": task}]
tools = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1,
}
]
while True:
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
betas=["computer-use-2025-11-24"],
)
# Check if Claude is done
if response.stop_reason == "end_turn":
# Extract final text response
for block in response.content:
if hasattr(block, "text"):
print(f"Done: {block.text}")
break
# Process tool use blocks
tool_results = []
for block in response.content:
if block.type == "tool_use":
action = block.input.get("action", "screenshot")
if action == "screenshot":
screenshot = take_screenshot()
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot,
},
}
],
})
else:
execute_action({"type": action, **block.input})
# After action, take a screenshot to show the result
screenshot = take_screenshot()
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot,
},
}
],
})
# Add assistant response and tool results to messages
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
# Run it
run_computer_use("Open Firefox and go to example.com")
TypeScript
import Anthropic from "@anthropic-ai/sdk";
import { execSync } from "child_process";
import { readFileSync } from "fs";
const client = new Anthropic();
function takeScreenshot(): string {
execSync("scrot /tmp/screenshot.png");
const buffer = readFileSync("/tmp/screenshot.png");
return buffer.toString("base64");
}
function executeAction(action: Record<string, unknown>): void {
const type = action.type as string;
if (type === "left_click" || type === "right_click" || type === "double_click") {
const [x, y] = action.coordinate as number[];
const button = type === "right_click" ? "3" : "1";
const repeat = type === "double_click" ? "2" : "1";
execSync(`xdotool mousemove ${x} ${y} click --repeat ${repeat} ${button}`);
} else if (type === "type") {
execSync(`xdotool type --delay 50 "${action.text}"`);
} else if (type === "key") {
execSync(`xdotool key "${action.text}"`);
}
}
async function runComputerUse(task: string): Promise<void> {
const messages: Anthropic.MessageParam[] = [
{ role: "user", content: task },
];
const tools: Anthropic.Tool[] = [
{
type: "computer_20241022" as any,
name: "computer",
display_width_px: 1920,
display_height_px: 1080,
display_number: 1,
} as any,
];
while (true) {
const response = await client.beta.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
tools,
messages,
betas: ["computer-use-2025-11-24"],
});
if (response.stop_reason === "end_turn") {
for (const block of response.content) {
if ("text" in block) {
console.log(`Done: ${block.text}`);
}
}
break;
}
const results: any[] = [];
for (const block of response.content) {
if (block.type === "tool_use") {
const action = (block.input as any).action || "screenshot";
if (action === "screenshot") {
const screenshot = takeScreenshot();
results.push({
type: "tool_result",
tool_use_id: block.id,
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: screenshot,
},
},
],
});
} else {
executeAction({ type: action, ...(block.input as any) });
const screenshot = takeScreenshot();
results.push({
type: "tool_result",
tool_use_id: block.id,
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: screenshot,
},
},
],
});
}
}
}
messages.push({ role: "assistant", content: response.content });
messages.push({ role: "user", content: results });
}
}
runComputerUse("Open Firefox and go to example.com");
Practical Example: Fill Out a Web Form
Let us build something useful. This example opens a browser and fills out a contact form:
import anthropic
import base64
import subprocess
import time
client = anthropic.Anthropic()
def take_screenshot() -> str:
subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
with open("/tmp/screenshot.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode()
def run_form_filler():
"""Fill out a web form automatically."""
task = """
1. Open Firefox
2. Go to https://httpbin.org/forms/post
3. Fill in the form with this data:
- Customer name: Alex Johnson
- Telephone: 555-0123
- E-mail: alex@example.com
- Size: Large
- Topping: Bacon
- Delivery time: 11:45
- Comments: Please ring the doorbell
4. Click the Submit button
5. Tell me what the response page shows
"""
# Use the same loop pattern from above
messages = [{"role": "user", "content": task}]
tools = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1,
}
]
step = 0
while True:
step += 1
print(f"Step {step}...")
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
betas=["computer-use-2025-11-24"],
)
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
print(f"\nResult: {block.text}")
break
tool_results = []
for block in response.content:
if block.type == "tool_use":
action = block.input.get("action", "screenshot")
if action != "screenshot":
# Execute the action
execute_action({"type": action, **block.input})
time.sleep(0.5) # Wait for UI to update
screenshot = take_screenshot()
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot,
},
}
],
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
print(f"\nCompleted in {step} steps")
run_form_filler()
A form-filling task like this typically takes 15-25 API calls and costs about $0.30-0.50 with Sonnet 4.6.
Building a Web Scraper
Computer use works well for scraping websites that are hard to scrape with traditional tools — like single-page apps with JavaScript rendering:
import anthropic
import json
client = anthropic.Anthropic()
def scrape_with_claude(url: str, instructions: str) -> dict:
"""Use Claude to scrape data from a website visually."""
task = f"""
1. Open Firefox and go to {url}
2. Wait for the page to load completely
3. {instructions}
4. Return the extracted data as JSON
"""
# ... (same computer use loop as above)
# Claude navigates the page, reads the content visually,
# and returns structured data
# Example: scrape product information
result = scrape_with_claude(
"https://example-store.com/products",
"Extract the name, price, and rating of the first 5 products on the page"
)
This approach is slower than traditional scraping (10-30 seconds per page vs milliseconds). But it works on any website without writing custom selectors.
Safety and Security
Computer use gives Claude real control over a computer. Follow these rules:
Always use a container or VM. Never run computer use on your main desktop. The Docker reference implementation is a good starting point.
Limit network access. The container should only access the websites it needs. Block access to sensitive internal services.
Limit permissions. Run as a non-root user with minimal filesystem access.
Set timeouts. A runaway task can make hundreds of API calls. Set a maximum step count:
MAX_STEPS = 50
step = 0
while step < MAX_STEPS:
step += 1
# ... computer use loop
if step >= MAX_STEPS:
print("Task exceeded maximum steps. Stopping.")
Never use computer use for authentication. Do not let Claude type passwords or access sensitive accounts.
Monitor costs. Each screenshot is roughly 1,500 tokens. A 20-step task with Sonnet 4.6 costs about $0.30-0.50.
Limitations
Computer use has several limitations you should know about:
| Limitation | Details |
|---|---|
| Latency | Each step takes 2-5 seconds (screenshot + API call + action) |
| Small UI elements | Claude sometimes misclicks on small buttons or checkboxes |
| Dynamic content | Fast-changing content (animations, videos) can confuse Claude |
| Resolution | Higher resolution = more tokens = higher cost |
| Cost | A 20-step task costs $0.30-0.50 with Sonnet 4.6 |
| Beta | The API may change. The beta header is required |
Tips for Better Accuracy
- Use clear, specific instructions. “Click the blue Submit button at the bottom of the form” is better than “submit the form.”
- Add wait times between actions. Give the UI time to update before taking the next screenshot.
- Use lower resolution when possible. 1280x720 is usually enough and uses fewer tokens.
- Break complex tasks into steps. “Do A, then B, then C” works better than “Do ABC.”
Cost Breakdown
| Component | Tokens | Cost (Sonnet 4.6) |
|---|---|---|
| Screenshot (1920x1080) | ~1,500 | $0.0045 input |
| Task description | ~200 | $0.0006 input |
| Claude’s response (per step) | ~300 | $0.0045 output |
| Per step total | ~$0.01 | |
| 20-step task | ~$0.20 |
Computer use is more expensive than text-only API calls. Use it when you cannot automate a task any other way.
When to Use Computer Use
Use computer use when:
- The application has no API
- You need to interact with a visual interface
- Traditional scraping fails (JavaScript-heavy sites)
- You need to automate a desktop application
Do not use computer use when:
- The application has an API (use the API directly)
- You can use a headless browser with Playwright/Puppeteer
- The task only involves text processing
- You need sub-second response times
Summary
| Concept | Details |
|---|---|
| What it does | Claude controls a computer via screenshots + mouse/keyboard |
| Beta header | computer-use-2025-11-24 |
| Supported models | Opus 4.6, Sonnet 4.6, Sonnet 4.5, Haiku 4.5 |
| Safety | Always use Docker container or VM |
| Cost | ~$0.01 per step, ~$0.20 for a 20-step task |
| Best for | UI automation when no API exists |
What’s Next?
In the next article, we will build a RAG (Retrieval-Augmented Generation) system with Claude — giving Claude knowledge from your own documents.
Next: RAG (Retrieval-Augmented Generation) with Claude