Claude AI Tutorial #9: Vision — Analyzing Images and Documents

Claude can see. Send it a screenshot, a photo of a document, a chart, or a technical diagram — and it will analyze what it sees. Vision turns Claude into a powerful tool for data extraction, UI review, and document processing.

This is Article 9 in the Claude AI — From Zero to Power User series. You should have completed Article 7: Messages API before this article.

By the end of this article, you will know how to send images to Claude, extract data from documents, analyze screenshots, and optimize image costs.

What Claude Can See

Claude supports four image formats:

PNG — screenshots, diagrams, charts
JPEG — photos, scanned documents
GIF — static and animated (first frame only)
WebP — web images

Limits:

Up to 100 images per API call
Maximum resolution: 8000 x 8000 pixels
Optimal size: 1.15 megapixels (about 1072 x 1072 pixels)

Images larger than 1.15 megapixels are automatically resized. Sending very large images wastes tokens without improving quality.

Sending Images: Base64

The most common way to send images is base64 encoding. This works for local files.

Python

import anthropic
import base64

client = anthropic.Anthropic()

# Read and encode the image
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot."
                }
            ]
        }
    ]
)

print(message.content[0].text)

TypeScript

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";

const client = new Anthropic();

const imageData = readFileSync("screenshot.png").toString("base64");

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: imageData,
          },
        },
        {
          type: "text",
          text: "Describe what you see in this screenshot.",
        },
      ],
    },
  ],
});

if (message.content[0].type === "text") {
  console.log(message.content[0].text);
}

Notice the message content is now an array of blocks — an image block and a text block. This is how you combine images with text instructions.

Sending Images: URL

You can also send images by URL. Claude fetches the image directly.

Python

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "What data does this chart show? List the key numbers."
                }
            ]
        }
    ]
)

print(message.content[0].text)

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "url",
            url: "https://example.com/chart.png",
          },
        },
        {
          type: "text",
          text: "What data does this chart show? List the key numbers.",
        },
      ],
    },
  ],
});

if (message.content[0].type === "text") {
  console.log(message.content[0].text);
}

URL-based images are simpler but require the image to be publicly accessible.

Use Case 1: Screenshot Analysis

Claude is excellent at analyzing UI screenshots. Use it for code review, design feedback, or bug detection.

Python

import anthropic
import base64

client = anthropic.Anthropic()

with open("app-screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="""You are a UI/UX reviewer. Analyze screenshots and provide actionable feedback.

<rules>
- Focus on usability issues, accessibility problems, and visual inconsistencies
- Rate each issue as: critical, warning, or suggestion
- Suggest specific fixes for each issue
</rules>""",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Review this app screenshot for UI/UX issues."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Claude will identify issues like:

Text that is too small to read
Buttons without enough contrast
Missing loading states
Layout problems on different screen sizes
Inconsistent spacing or alignment

Use Case 2: Document Data Extraction

Extract structured data from invoices, receipts, or forms.

Python

import anthropic
import base64
import json

client = anthropic.Anthropic()

with open("invoice.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="""Extract invoice data into JSON. Return ONLY valid JSON, no other text.

Expected format:
{
  "invoice_number": "string",
  "date": "YYYY-MM-DD",
  "vendor": "string",
  "items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax": number,
  "total": number
}""",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all data from this invoice."
                }
            ]
        }
    ]
)

invoice_data = json.loads(message.content[0].text)
print(json.dumps(invoice_data, indent=2))

TypeScript

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";

const client = new Anthropic();

const imageData = readFileSync("invoice.png").toString("base64");

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  system: `Extract invoice data into JSON. Return ONLY valid JSON, no other text.

Expected format:
{
  "invoice_number": "string",
  "date": "YYYY-MM-DD",
  "vendor": "string",
  "items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax": number,
  "total": number
}`,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: imageData,
          },
        },
        {
          type: "text",
          text: "Extract all data from this invoice.",
        },
      ],
    },
  ],
});

if (message.content[0].type === "text") {
  const invoiceData = JSON.parse(message.content[0].text);
  console.log(JSON.stringify(invoiceData, null, 2));
}

For reliable JSON extraction, combine vision with structured output (covered in Article 10).

Use Case 3: Chart and Graph Analysis

Claude can read data from charts, graphs, and plots.

import anthropic
import base64

client = anthropic.Anthropic()

with open("sales-chart.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Analyze this chart and provide:
1. What type of chart is this?
2. What data does it show?
3. What are the key numbers (highest value, lowest value, average)?
4. What trends do you see?
5. Summarize the main insight in one sentence."""
                }
            ]
        }
    ]
)

print(message.content[0].text)

Claude handles bar charts, line charts, pie charts, scatter plots, and most standard chart types. It is less accurate with very small or low-contrast charts.

Comparing Multiple Images

Send multiple images in one request to compare them.

Python

import anthropic
import base64

client = anthropic.Anthropic()

def load_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

before = load_image("before.png")
after = load_image("after.png")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two screenshots. The first is the old design, the second is the new design."
                },
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": before}
                },
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": after}
                },
                {
                    "type": "text",
                    "text": "List all visual differences between the two designs."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Use cases for multi-image analysis:

Before/after comparisons
Comparing designs across different themes (light vs dark)
Verifying UI consistency across screens
Batch document processing

OCR: Reading Text from Images

Claude can read text from photos, scanned documents, and screenshots. It works like OCR but with understanding.

import anthropic
import base64

client = anthropic.Anthropic()

with open("whiteboard-photo.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Read all text from this whiteboard photo. Organize it into a structured markdown document."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Unlike traditional OCR tools, Claude understands context. It can:

Correct obvious spelling errors in handwritten text
Organize messy notes into structured documents
Translate text in images
Understand diagrams and their labels together

Technical Diagram Analysis

Claude handles UML diagrams, architecture diagrams, flowcharts, and similar technical visuals.

import anthropic
import base64

client = anthropic.Anthropic()

with open("architecture-diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a software architect. Analyze technical diagrams and explain them clearly.",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Analyze this architecture diagram:
1. List all components and their roles
2. Describe the data flow between components
3. Identify any potential bottlenecks or single points of failure
4. Suggest improvements"""
                }
            ]
        }
    ]
)

print(message.content[0].text)

Image Token Costs

Images are converted to tokens for billing. The cost depends on image size.

Image Size	Approximate Tokens
200 x 200 px	~200 tokens
500 x 500 px	~600 tokens
1000 x 1000 px	~1,600 tokens
1500 x 1500 px	~3,000 tokens
4000 x 4000 px	~3,000 tokens (resized to ~1.15MP)

Images larger than 1.15 megapixels are resized before processing. There is no benefit to sending a 4K screenshot — it will be resized to approximately 1.15 megapixels and cost the same.

Cost example: Analyzing one 1000x1000 screenshot with Sonnet 4.6 costs approximately $0.005 for the image tokens plus the cost of the text output.

Optimization Tips

Resize before sending — Crop to the relevant area and resize to under 1.15 megapixels
Use JPEG for photos — JPEG files are smaller than PNG for photographs
Use PNG for screenshots — PNG preserves text clarity better than JPEG
Batch related images — Send multiple images in one request instead of separate calls

What Claude Cannot See

Vision is powerful but has limitations:

Very small text — Text under ~12px in a screenshot may be misread
Blurry images — Low-resolution or out-of-focus images reduce accuracy
Complex spatial reasoning — “Is the red dot above or below the blue line?” can be unreliable
Exact pixel measurements — Claude cannot measure exact distances in pixels
CAPTCHAs — Claude will not attempt to solve CAPTCHAs
Very dense documents — Pages with tiny fonts and hundreds of data points may lose some data

For best results, send clear, well-lit images at a reasonable size. Crop to the area of interest when possible.

Real-World Example: Automated Screenshot Testing

Here is a practical example that compares a screenshot against expected behavior:

Python

import anthropic
import base64
import json

client = anthropic.Anthropic()

def load_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def check_screenshot(screenshot_path: str, requirements: list[str]) -> dict:
    image_data = load_image(screenshot_path)

    requirements_text = "\n".join(f"- {r}" for r in requirements)

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="""You are a QA tester. Check if a screenshot meets the given requirements.

Return JSON:
{
  "pass": true/false,
  "results": [
    {"requirement": "...", "status": "pass" or "fail", "details": "..."}
  ]
}

Return ONLY valid JSON.""",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": image_data
                        }
                    },
                    {
                        "type": "text",
                        "text": f"Check this screenshot against these requirements:\n{requirements_text}"
                    }
                ]
            }
        ]
    )

    return json.loads(message.content[0].text)

# Use it
result = check_screenshot("login-page.png", [
    "Login form is visible with email and password fields",
    "Submit button says 'Sign In'",
    "There is a 'Forgot Password?' link",
    "Company logo is at the top of the page"
])

print(json.dumps(result, indent=2))

This pattern is useful for visual regression testing in CI/CD pipelines. Take a screenshot of your app, send it to Claude, and verify it matches expectations.

Summary

Feature	Details
Formats	PNG, JPEG, GIF, WebP
Max images per call	100
Optimal size	1.15 megapixels
Send via	Base64 or URL
Cost	~1,600 tokens for a 1000x1000 image
Best for	Screenshots, documents, charts, diagrams, OCR

Vision makes Claude useful for tasks that were previously impossible with text-only AI. Combine it with tool use and structured output for powerful document processing pipelines.

What’s Next?

In the next article, we will cover Structured Output — getting guaranteed valid JSON from Claude using schemas.

Next: Structured Output — JSON Mode and Schemas

What Claude Can See#

Sending Images: Base64#

Python#

TypeScript#

Sending Images: URL#

Python#

TypeScript#

Use Case 1: Screenshot Analysis#

Python#

Use Case 2: Document Data Extraction#

Python#

TypeScript#

Use Case 3: Chart and Graph Analysis#

Comparing Multiple Images#

Python#

OCR: Reading Text from Images#

Technical Diagram Analysis#

Image Token Costs#

Optimization Tips#

What Claude Cannot See#

Real-World Example: Automated Screenshot Testing#

Python#

Summary#

What’s Next?#

Related Articles#

What Claude Can See

Sending Images: Base64

Python

TypeScript

Sending Images: URL

Python

TypeScript

Use Case 1: Screenshot Analysis

Python

Use Case 2: Document Data Extraction

Python

TypeScript

Use Case 3: Chart and Graph Analysis

Comparing Multiple Images

Python

OCR: Reading Text from Images

Technical Diagram Analysis

Image Token Costs

Optimization Tips

What Claude Cannot See

Real-World Example: Automated Screenshot Testing

Python

Summary

What’s Next?

Related Articles