Claude can see. Send it a screenshot, a photo of a document, a chart, or a technical diagram — and it will analyze what it sees. Vision turns Claude into a powerful tool for data extraction, UI review, and document processing.

This is Article 9 in the Claude AI — From Zero to Power User series. You should have completed Article 7: Messages API before this article.

By the end of this article, you will know how to send images to Claude, extract data from documents, analyze screenshots, and optimize image costs.


What Claude Can See

Claude supports four image formats:

  • PNG — screenshots, diagrams, charts
  • JPEG — photos, scanned documents
  • GIF — static and animated (first frame only)
  • WebP — web images

Limits:

  • Up to 100 images per API call
  • Maximum resolution: 8000 x 8000 pixels
  • Optimal size: 1.15 megapixels (about 1072 x 1072 pixels)

Images larger than 1.15 megapixels are automatically resized. Sending very large images wastes tokens without improving quality.


Sending Images: Base64

The most common way to send images is base64 encoding. This works for local files.

Python

import anthropic
import base64

client = anthropic.Anthropic()

# Read and encode the image
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot."
                }
            ]
        }
    ]
)

print(message.content[0].text)

TypeScript

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";

const client = new Anthropic();

const imageData = readFileSync("screenshot.png").toString("base64");

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: imageData,
          },
        },
        {
          type: "text",
          text: "Describe what you see in this screenshot.",
        },
      ],
    },
  ],
});

if (message.content[0].type === "text") {
  console.log(message.content[0].text);
}

Notice the message content is now an array of blocks — an image block and a text block. This is how you combine images with text instructions.


Sending Images: URL

You can also send images by URL. Claude fetches the image directly.

Python

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "What data does this chart show? List the key numbers."
                }
            ]
        }
    ]
)

print(message.content[0].text)

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "url",
            url: "https://example.com/chart.png",
          },
        },
        {
          type: "text",
          text: "What data does this chart show? List the key numbers.",
        },
      ],
    },
  ],
});

if (message.content[0].type === "text") {
  console.log(message.content[0].text);
}

URL-based images are simpler but require the image to be publicly accessible.


Use Case 1: Screenshot Analysis

Claude is excellent at analyzing UI screenshots. Use it for code review, design feedback, or bug detection.

Python

import anthropic
import base64

client = anthropic.Anthropic()

with open("app-screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="""You are a UI/UX reviewer. Analyze screenshots and provide actionable feedback.

<rules>
- Focus on usability issues, accessibility problems, and visual inconsistencies
- Rate each issue as: critical, warning, or suggestion
- Suggest specific fixes for each issue
</rules>""",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Review this app screenshot for UI/UX issues."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Claude will identify issues like:

  • Text that is too small to read
  • Buttons without enough contrast
  • Missing loading states
  • Layout problems on different screen sizes
  • Inconsistent spacing or alignment

Use Case 2: Document Data Extraction

Extract structured data from invoices, receipts, or forms.

Python

import anthropic
import base64
import json

client = anthropic.Anthropic()

with open("invoice.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="""Extract invoice data into JSON. Return ONLY valid JSON, no other text.

Expected format:
{
  "invoice_number": "string",
  "date": "YYYY-MM-DD",
  "vendor": "string",
  "items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax": number,
  "total": number
}""",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all data from this invoice."
                }
            ]
        }
    ]
)

invoice_data = json.loads(message.content[0].text)
print(json.dumps(invoice_data, indent=2))

TypeScript

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";

const client = new Anthropic();

const imageData = readFileSync("invoice.png").toString("base64");

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  system: `Extract invoice data into JSON. Return ONLY valid JSON, no other text.

Expected format:
{
  "invoice_number": "string",
  "date": "YYYY-MM-DD",
  "vendor": "string",
  "items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax": number,
  "total": number
}`,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: imageData,
          },
        },
        {
          type: "text",
          text: "Extract all data from this invoice.",
        },
      ],
    },
  ],
});

if (message.content[0].type === "text") {
  const invoiceData = JSON.parse(message.content[0].text);
  console.log(JSON.stringify(invoiceData, null, 2));
}

For reliable JSON extraction, combine vision with structured output (covered in Article 10).


Use Case 3: Chart and Graph Analysis

Claude can read data from charts, graphs, and plots.

import anthropic
import base64

client = anthropic.Anthropic()

with open("sales-chart.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Analyze this chart and provide:
1. What type of chart is this?
2. What data does it show?
3. What are the key numbers (highest value, lowest value, average)?
4. What trends do you see?
5. Summarize the main insight in one sentence."""
                }
            ]
        }
    ]
)

print(message.content[0].text)

Claude handles bar charts, line charts, pie charts, scatter plots, and most standard chart types. It is less accurate with very small or low-contrast charts.


Comparing Multiple Images

Send multiple images in one request to compare them.

Python

import anthropic
import base64

client = anthropic.Anthropic()

def load_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

before = load_image("before.png")
after = load_image("after.png")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two screenshots. The first is the old design, the second is the new design."
                },
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": before}
                },
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": after}
                },
                {
                    "type": "text",
                    "text": "List all visual differences between the two designs."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Use cases for multi-image analysis:

  • Before/after comparisons
  • Comparing designs across different themes (light vs dark)
  • Verifying UI consistency across screens
  • Batch document processing

OCR: Reading Text from Images

Claude can read text from photos, scanned documents, and screenshots. It works like OCR but with understanding.

import anthropic
import base64

client = anthropic.Anthropic()

with open("whiteboard-photo.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Read all text from this whiteboard photo. Organize it into a structured markdown document."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Unlike traditional OCR tools, Claude understands context. It can:

  • Correct obvious spelling errors in handwritten text
  • Organize messy notes into structured documents
  • Translate text in images
  • Understand diagrams and their labels together

Technical Diagram Analysis

Claude handles UML diagrams, architecture diagrams, flowcharts, and similar technical visuals.

import anthropic
import base64

client = anthropic.Anthropic()

with open("architecture-diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a software architect. Analyze technical diagrams and explain them clearly.",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Analyze this architecture diagram:
1. List all components and their roles
2. Describe the data flow between components
3. Identify any potential bottlenecks or single points of failure
4. Suggest improvements"""
                }
            ]
        }
    ]
)

print(message.content[0].text)

Image Token Costs

Images are converted to tokens for billing. The cost depends on image size.

Image SizeApproximate Tokens
200 x 200 px~200 tokens
500 x 500 px~600 tokens
1000 x 1000 px~1,600 tokens
1500 x 1500 px~3,000 tokens
4000 x 4000 px~3,000 tokens (resized to ~1.15MP)

Images larger than 1.15 megapixels are resized before processing. There is no benefit to sending a 4K screenshot — it will be resized to approximately 1.15 megapixels and cost the same.

Cost example: Analyzing one 1000x1000 screenshot with Sonnet 4.6 costs approximately $0.005 for the image tokens plus the cost of the text output.

Optimization Tips

  1. Resize before sending — Crop to the relevant area and resize to under 1.15 megapixels
  2. Use JPEG for photos — JPEG files are smaller than PNG for photographs
  3. Use PNG for screenshots — PNG preserves text clarity better than JPEG
  4. Batch related images — Send multiple images in one request instead of separate calls

What Claude Cannot See

Vision is powerful but has limitations:

  • Very small text — Text under ~12px in a screenshot may be misread
  • Blurry images — Low-resolution or out-of-focus images reduce accuracy
  • Complex spatial reasoning — “Is the red dot above or below the blue line?” can be unreliable
  • Exact pixel measurements — Claude cannot measure exact distances in pixels
  • CAPTCHAs — Claude will not attempt to solve CAPTCHAs
  • Very dense documents — Pages with tiny fonts and hundreds of data points may lose some data

For best results, send clear, well-lit images at a reasonable size. Crop to the area of interest when possible.


Real-World Example: Automated Screenshot Testing

Here is a practical example that compares a screenshot against expected behavior:

Python

import anthropic
import base64
import json

client = anthropic.Anthropic()

def load_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def check_screenshot(screenshot_path: str, requirements: list[str]) -> dict:
    image_data = load_image(screenshot_path)

    requirements_text = "\n".join(f"- {r}" for r in requirements)

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="""You are a QA tester. Check if a screenshot meets the given requirements.

Return JSON:
{
  "pass": true/false,
  "results": [
    {"requirement": "...", "status": "pass" or "fail", "details": "..."}
  ]
}

Return ONLY valid JSON.""",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": image_data
                        }
                    },
                    {
                        "type": "text",
                        "text": f"Check this screenshot against these requirements:\n{requirements_text}"
                    }
                ]
            }
        ]
    )

    return json.loads(message.content[0].text)

# Use it
result = check_screenshot("login-page.png", [
    "Login form is visible with email and password fields",
    "Submit button says 'Sign In'",
    "There is a 'Forgot Password?' link",
    "Company logo is at the top of the page"
])

print(json.dumps(result, indent=2))

This pattern is useful for visual regression testing in CI/CD pipelines. Take a screenshot of your app, send it to Claude, and verify it matches expectations.


Summary

FeatureDetails
FormatsPNG, JPEG, GIF, WebP
Max images per call100
Optimal size1.15 megapixels
Send viaBase64 or URL
Cost~1,600 tokens for a 1000x1000 image
Best forScreenshots, documents, charts, diagrams, OCR

Vision makes Claude useful for tasks that were previously impossible with text-only AI. Combine it with tool use and structured output for powerful document processing pipelines.


What’s Next?

In the next article, we will cover Structured Output — getting guaranteed valid JSON from Claude using schemas.

Next: Structured Output — JSON Mode and Schemas