Claude AI Tutorial #16: Computer Use — Desktop Automation with Screenshots and Mouse Control

Claude can see your screen and control your computer. It takes screenshots, analyzes what is on screen, then clicks buttons, types text, and navigates applications — just like a human would.

This is Article 16 in the Claude AI — From Zero to Power User series. You should know Tool Use before this article.

By the end, you will build a working desktop automation that fills out a web form automatically.

What is Computer Use?

Computer use is a Claude feature where the model controls a computer through three actions:

Screenshot — Take a picture of the screen
Mouse — Click, double-click, drag, scroll
Keyboard — Type text, press key combinations

Claude sees the screenshot, decides what to do next, and sends a command. The loop repeats until the task is done.

This is different from regular tool use. With regular tools, you define functions. With computer use, Claude interacts with any application — web browsers, desktop apps, terminals — through the visual interface.

Beta Status

Computer use is currently in beta. You need a special header in your API calls:

anthropic-beta: computer-use-2025-11-24

Supported models: Claude Opus 4.6, Sonnet 4.6, Sonnet 4.5, and Haiku 4.5.

How It Works

The computer use loop has four steps:

You send a task to Claude with the computer_20241022 tool enabled
Claude requests a screenshot
You take the screenshot and send it back
Claude analyzes the image and sends mouse/keyboard actions
Repeat from step 2 until the task is complete

Each round trip is one API call. A simple task like clicking a button takes 2-3 calls. A complex task like filling out a form takes 10-20 calls.

Setting Up the Environment

For safety, always run computer use in an isolated environment. Use a Docker container or a virtual machine. Never give Claude access to your main desktop.

Anthropic provides a reference implementation with Docker:

# Pull the reference implementation
git clone https://github.com/anthropics/anthropic-quickstarts.git
cd anthropic-quickstarts/computer-use-demo

# Set your API key
export ANTHROPIC_API_KEY="your-key-here"

# Run the Docker container
docker compose up

This starts a container with a lightweight desktop environment and a web interface at http://localhost:8080.

Basic Computer Use — Python

Here is a minimal example that sends a computer use task:

Python

import anthropic
import base64
import subprocess

client = anthropic.Anthropic()

def take_screenshot() -> str:
    """Take a screenshot and return base64-encoded image."""
    subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()

def execute_action(action: dict):
    """Execute a mouse or keyboard action using xdotool."""
    if action["type"] in ("left_click", "right_click", "double_click"):
        x, y = action["coordinate"]
        button = "3" if action["type"] == "right_click" else "1"
        click_count = "2" if action["type"] == "double_click" else "1"
        subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "--repeat", click_count, button])
    elif action["type"] == "type":
        subprocess.run(["xdotool", "type", "--delay", "50", action["text"]])
    elif action["type"] == "key":
        subprocess.run(["xdotool", "key", action["text"]])
    elif action["type"] == "screenshot":
        pass  # Screenshot is handled separately

def run_computer_use(task: str):
    """Run a computer use task to completion."""
    messages = [{"role": "user", "content": task}]

    tools = [
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1920,
            "display_height_px": 1080,
            "display_number": 1,
        }
    ]

    while True:
        response = client.beta.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages,
            betas=["computer-use-2025-11-24"],
        )

        # Check if Claude is done
        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    print(f"Done: {block.text}")
            break

        # Process tool use blocks
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                action = block.input.get("action", "screenshot")

                if action == "screenshot":
                    screenshot = take_screenshot()
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": [
                            {
                                "type": "image",
                                "source": {
                                    "type": "base64",
                                    "media_type": "image/png",
                                    "data": screenshot,
                                },
                            }
                        ],
                    })
                else:
                    execute_action({"type": action, **block.input})
                    # After action, take a screenshot to show the result
                    screenshot = take_screenshot()
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": [
                            {
                                "type": "image",
                                "source": {
                                    "type": "base64",
                                    "media_type": "image/png",
                                    "data": screenshot,
                                },
                            }
                        ],
                    })

        # Add assistant response and tool results to messages
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

# Run it
run_computer_use("Open Firefox and go to example.com")

TypeScript

import Anthropic from "@anthropic-ai/sdk";
import { execSync } from "child_process";
import { readFileSync } from "fs";

const client = new Anthropic();

function takeScreenshot(): string {
  execSync("scrot /tmp/screenshot.png");
  const buffer = readFileSync("/tmp/screenshot.png");
  return buffer.toString("base64");
}

function executeAction(action: Record<string, unknown>): void {
  const type = action.type as string;
  if (type === "left_click" || type === "right_click" || type === "double_click") {
    const [x, y] = action.coordinate as number[];
    const button = type === "right_click" ? "3" : "1";
    const repeat = type === "double_click" ? "2" : "1";
    execSync(`xdotool mousemove ${x} ${y} click --repeat ${repeat} ${button}`);
  } else if (type === "type") {
    execSync(`xdotool type --delay 50 "${action.text}"`);
  } else if (type === "key") {
    execSync(`xdotool key "${action.text}"`);
  }
}

async function runComputerUse(task: string): Promise<void> {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: task },
  ];

  const tools: Anthropic.Tool[] = [
    {
      type: "computer_20241022" as any,
      name: "computer",
      display_width_px: 1920,
      display_height_px: 1080,
      display_number: 1,
    } as any,
  ];

  while (true) {
    const response = await client.beta.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      tools,
      messages,
      betas: ["computer-use-2025-11-24"],
    });

    if (response.stop_reason === "end_turn") {
      for (const block of response.content) {
        if ("text" in block) {
          console.log(`Done: ${block.text}`);
        }
      }
      break;
    }

    const results: any[] = [];

    for (const block of response.content) {
      if (block.type === "tool_use") {
        const action = (block.input as any).action || "screenshot";

        if (action === "screenshot") {
          const screenshot = takeScreenshot();
          results.push({
            type: "tool_result",
            tool_use_id: block.id,
            content: [
              {
                type: "image",
                source: {
                  type: "base64",
                  media_type: "image/png",
                  data: screenshot,
                },
              },
            ],
          });
        } else {
          executeAction({ type: action, ...(block.input as any) });
          const screenshot = takeScreenshot();
          results.push({
            type: "tool_result",
            tool_use_id: block.id,
            content: [
              {
                type: "image",
                source: {
                  type: "base64",
                  media_type: "image/png",
                  data: screenshot,
                },
              },
            ],
          });
        }
      }
    }

    messages.push({ role: "assistant", content: response.content });
    messages.push({ role: "user", content: results });
  }
}

runComputerUse("Open Firefox and go to example.com");

Practical Example: Fill Out a Web Form

Let us build something useful. This example opens a browser and fills out a contact form:

import anthropic
import base64
import subprocess
import time

client = anthropic.Anthropic()

def take_screenshot() -> str:
    subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()

def run_form_filler():
    """Fill out a web form automatically."""
    task = """
    1. Open Firefox
    2. Go to https://httpbin.org/forms/post
    3. Fill in the form with this data:
       - Customer name: Alex Johnson
       - Telephone: 555-0123
       - E-mail: alex@example.com
       - Size: Large
       - Topping: Bacon
       - Delivery time: 11:45
       - Comments: Please ring the doorbell
    4. Click the Submit button
    5. Tell me what the response page shows
    """

    # Use the same loop pattern from above
    messages = [{"role": "user", "content": task}]
    tools = [
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1920,
            "display_height_px": 1080,
            "display_number": 1,
        }
    ]

    step = 0
    while True:
        step += 1
        print(f"Step {step}...")

        response = client.beta.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages,
            betas=["computer-use-2025-11-24"],
        )

        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    print(f"\nResult: {block.text}")
            break

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                action = block.input.get("action", "screenshot")
                if action != "screenshot":
                    # Execute the action
                    execute_action({"type": action, **block.input})
                    time.sleep(0.5)  # Wait for UI to update

                screenshot = take_screenshot()
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/png",
                                "data": screenshot,
                            },
                        }
                    ],
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    print(f"\nCompleted in {step} steps")

run_form_filler()

A form-filling task like this typically takes 15-25 API calls and costs about $0.30-0.50 with Sonnet 4.6.

Building a Web Scraper

Computer use works well for scraping websites that are hard to scrape with traditional tools — like single-page apps with JavaScript rendering:

import anthropic
import json

client = anthropic.Anthropic()

def scrape_with_claude(url: str, instructions: str) -> dict:
    """Use Claude to scrape data from a website visually."""
    task = f"""
    1. Open Firefox and go to {url}
    2. Wait for the page to load completely
    3. {instructions}
    4. Return the extracted data as JSON
    """

    # ... (same computer use loop as above)
    # Claude navigates the page, reads the content visually,
    # and returns structured data

# Example: scrape product information
result = scrape_with_claude(
    "https://example-store.com/products",
    "Extract the name, price, and rating of the first 5 products on the page"
)

This approach is slower than traditional scraping (10-30 seconds per page vs milliseconds). But it works on any website without writing custom selectors.

Safety and Security

Computer use gives Claude real control over a computer. Follow these rules:

Always use a container or VM. Never run computer use on your main desktop. The Docker reference implementation is a good starting point.
Limit network access. The container should only access the websites it needs. Block access to sensitive internal services.
Limit permissions. Run as a non-root user with minimal filesystem access.
Set timeouts. A runaway task can make hundreds of API calls. Set a maximum step count:

MAX_STEPS = 50

step = 0
while step < MAX_STEPS:
    step += 1
    # ... computer use loop

if step >= MAX_STEPS:
    print("Task exceeded maximum steps. Stopping.")

Never use computer use for authentication. Do not let Claude type passwords or access sensitive accounts.
Monitor costs. Each screenshot is roughly 1,500 tokens. A 20-step task with Sonnet 4.6 costs about $0.30-0.50.

Limitations

Computer use has several limitations you should know about:

Limitation	Details
Latency	Each step takes 2-5 seconds (screenshot + API call + action)
Small UI elements	Claude sometimes misclicks on small buttons or checkboxes
Dynamic content	Fast-changing content (animations, videos) can confuse Claude
Resolution	Higher resolution = more tokens = higher cost
Cost	A 20-step task costs $0.30-0.50 with Sonnet 4.6
Beta	The API may change. The beta header is required

Tips for Better Accuracy

Use clear, specific instructions. “Click the blue Submit button at the bottom of the form” is better than “submit the form.”
Add wait times between actions. Give the UI time to update before taking the next screenshot.
Use lower resolution when possible. 1280x720 is usually enough and uses fewer tokens.
Break complex tasks into steps. “Do A, then B, then C” works better than “Do ABC.”

Cost Breakdown

Component	Tokens	Cost (Sonnet 4.6)
Screenshot (1920x1080)	~1,500	$0.0045 input
Task description	~200	$0.0006 input
Claude’s response (per step)	~300	$0.0045 output
Per step total		~$0.01
20-step task		~$0.20

Computer use is more expensive than text-only API calls. Use it when you cannot automate a task any other way.

When to Use Computer Use

Use computer use when:

The application has no API
You need to interact with a visual interface
Traditional scraping fails (JavaScript-heavy sites)
You need to automate a desktop application

Do not use computer use when:

The application has an API (use the API directly)
You can use a headless browser with Playwright/Puppeteer
The task only involves text processing
You need sub-second response times

Summary

Concept	Details
What it does	Claude controls a computer via screenshots + mouse/keyboard
Beta header	`computer-use-2025-11-24`
Supported models	Opus 4.6, Sonnet 4.6, Sonnet 4.5, Haiku 4.5
Safety	Always use Docker container or VM
Cost	~$0.01 per step, ~$0.20 for a 20-step task
Best for	UI automation when no API exists

What’s Next?

In the next article, we will build a RAG (Retrieval-Augmented Generation) system with Claude — giving Claude knowledge from your own documents.

Next: RAG (Retrieval-Augmented Generation) with Claude

What is Computer Use?#

Beta Status#

How It Works#

Setting Up the Environment#

Basic Computer Use — Python#

Python#

TypeScript#

Practical Example: Fill Out a Web Form#

Building a Web Scraper#

Safety and Security#

Limitations#

Tips for Better Accuracy#

Cost Breakdown#

When to Use Computer Use#

Summary#

What’s Next?#

Related Articles#