Claude AI Tutorial #7: Messages API Deep Dive — Roles, Streaming, Multi-Turn

The Messages API is the foundation of everything you build with Claude. Every feature — tool use, vision, streaming, caching — goes through this API. Understanding it deeply will make every other article in this series easier.

This is Article 7 in the Claude AI — From Zero to Power User series. You should have completed Article 2: Getting Started before this article.

By the end of this article, you will know how to build multi-turn conversations, stream responses in real time, and track your token usage.

Messages API Anatomy

Every API call has the same basic structure:

Python

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a helpful coding assistant.",
    messages=[
        {"role": "user", "content": "What is a list comprehension in Python?"}
    ]
)

print(message.content[0].text)

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: "You are a helpful coding assistant.",
  messages: [
    { role: "user", content: "What is a list comprehension in Python?" }
  ],
});

if (message.content[0].type === "text") {
  console.log(message.content[0].text);
}

The required parameters are:

Parameter	Type	Description
`model`	string	Which Claude model to use
`max_tokens`	integer	Maximum tokens in the response
`messages`	array	The conversation history

Optional but common:

Parameter	Type	Description
`system`	string	System prompt (instructions for Claude)
`temperature`	float	Randomness (0.0 to 1.0)
`stop_sequences`	array	Stop generating when these strings appear
`metadata`	object	User ID for tracking and abuse detection

The Response Object

Every API call returns a message object with this structure:

{
  "id": "msg_01ABC123",
  "type": "message",
  "role": "assistant",
  "model": "claude-sonnet-4-6",
  "content": [
    {
      "type": "text",
      "text": "A list comprehension is..."
    }
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 25,
    "output_tokens": 150
  }
}

Key fields:

id — Unique message identifier
content — Array of content blocks (text, tool_use, etc.)
stop_reason — Why Claude stopped generating
usage — Token counts for cost tracking

Stop Reasons

Stop Reason	Meaning
`end_turn`	Claude finished its response naturally
`max_tokens`	Hit the max_tokens limit (response may be incomplete)
`stop_sequence`	Hit one of your stop sequences
`tool_use`	Claude wants to call a tool

Always check stop_reason. If it is max_tokens, your response was cut off. Increase max_tokens or continue the conversation.

Multi-Turn Conversations

Claude does not remember previous conversations. Each API call is independent. To have a multi-turn conversation, you send the entire message history every time.

Python

import anthropic

client = anthropic.Anthropic()

# Start the conversation
conversation = []

def chat(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a Python tutor. Give short, clear explanations with code examples.",
        messages=conversation
    )

    assistant_message = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_message})

    return assistant_message

# Turn 1
print(chat("What is a decorator in Python?"))

# Turn 2 — Claude remembers the context
print(chat("Can you show me a real-world example?"))

# Turn 3 — Claude still has full context
print(chat("How do I add arguments to a decorator?"))

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const conversation: Anthropic.MessageParam[] = [];

async function chat(userMessage: string): Promise<string> {
  conversation.push({ role: "user", content: userMessage });

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: "You are a Python tutor. Give short, clear explanations with code examples.",
    messages: conversation,
  });

  const assistantMessage =
    response.content[0].type === "text" ? response.content[0].text : "";
  conversation.push({ role: "assistant", content: assistantMessage });

  return assistantMessage;
}

// Turn 1
console.log(await chat("What is a decorator in Python?"));

// Turn 2
console.log(await chat("Can you show me a real-world example?"));

// Turn 3
console.log(await chat("How do I add arguments to a decorator?"));

Each API call sends the full conversation. By turn 3, you are sending 6 messages (3 user + 3 assistant). This means token usage grows with each turn.

Cost tip: A 10-turn conversation with Sonnet 4.6 costs approximately $0.03-0.10, depending on response length. Use prompt caching to reduce repeated context costs.

Streaming Responses

By default, the API waits until Claude finishes generating the entire response before returning. Streaming gives you tokens as they are generated — much better for user-facing applications.

Python

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain async/await in Python with examples"}
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

print()  # New line at the end

# Get the final message object for usage stats
final_message = stream.get_final_message()
print(f"\nTokens used: {final_message.usage.input_tokens} in, {final_message.usage.output_tokens} out")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const stream = client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    { role: "user", content: "Explain async/await in Python with examples" }
  ],
});

for await (const event of stream) {
  if (
    event.type === "content_block_delta" &&
    event.delta.type === "text_delta"
  ) {
    process.stdout.write(event.delta.text);
  }
}

console.log();

const finalMessage = await stream.finalMessage();
console.log(
  `\nTokens used: ${finalMessage.usage.input_tokens} in, ${finalMessage.usage.output_tokens} out`
);

Stream Event Types

The streaming API sends these events:

Event	When	Contains
`message_start`	First event	Message metadata (id, model)
`content_block_start`	New content block begins	Block type and index
`content_block_delta`	Token generated	The actual text delta
`content_block_stop`	Content block finished	Block index
`message_delta`	Message metadata update	Stop reason, usage
`message_stop`	Last event	Nothing (signals end)

For most use cases, you only need content_block_delta events to get the streaming text.

Token Usage Tracking

Every response includes token counts. Track them to monitor costs.

Python

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Write a Python function to validate email addresses"}
    ]
)

usage = message.usage
print(f"Input tokens:  {usage.input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
print(f"Total tokens:  {usage.input_tokens + usage.output_tokens}")

# Calculate cost (Sonnet 4.6 pricing)
input_cost = usage.input_tokens * 3.00 / 1_000_000
output_cost = usage.output_tokens * 15.00 / 1_000_000
total_cost = input_cost + output_cost
print(f"Cost: ${total_cost:.6f}")

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    { role: "user", content: "Write a Python function to validate email addresses" }
  ],
});

const usage = message.usage;
console.log(`Input tokens:  ${usage.input_tokens}`);
console.log(`Output tokens: ${usage.output_tokens}`);
console.log(`Total tokens:  ${usage.input_tokens + usage.output_tokens}`);

// Calculate cost (Sonnet 4.6 pricing)
const inputCost = (usage.input_tokens * 3.0) / 1_000_000;
const outputCost = (usage.output_tokens * 15.0) / 1_000_000;
const totalCost = inputCost + outputCost;
console.log(`Cost: $${totalCost.toFixed(6)}`);

Pricing Reference (March 2026)

Model	Input ($/MTok)	Output ($/MTok)
Opus 4.6	$5.00	$25.00
Sonnet 4.6	$3.00	$15.00
Haiku 4.5	$1.00	$5.00

MTok = million tokens. A typical coding question uses 200-500 input tokens and 300-1000 output tokens.

Setting max_tokens

The max_tokens parameter controls the maximum length of Claude’s response. It does not affect the input — only the output.

# Short response — good for classification, extraction
short = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=100,
    messages=[{"role": "user", "content": "Is this a bug report or feature request? Text: 'The app crashes when I click save'"}]
)

# Long response — good for code generation, explanations
long = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Write a complete REST API for user management in FastAPI"}]
)

Tips:

Set max_tokens to the maximum you might need, not the exact length you want
You are only charged for tokens actually generated, not the max_tokens limit
If Claude hits the limit, stop_reason will be max_tokens
Maximum values: 128K tokens for Opus 4.6, 64K for Sonnet 4.6 and Haiku 4.5

Stop Sequences

Stop sequences tell Claude to stop generating when it encounters a specific string.

Python

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    stop_sequences=["```"],
    messages=[
        {"role": "user", "content": "Write a Python function to sort a list. Only the code, in a code block."}
    ]
)

print(message.content[0].text)
print(f"Stop reason: {message.stop_reason}")  # "stop_sequence"

Common uses:

Stop at a closing code block marker
Stop at a specific delimiter in structured output
Stop at “END” or similar markers in formatted responses

Error Handling

The API returns standard HTTP error codes. Here is how to handle them properly.

Python

import anthropic
import time

client = anthropic.Anthropic()

def call_claude(messages: list, retries: int = 3) -> str:
    for attempt in range(retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text

        except anthropic.RateLimitError:
            # 429 — too many requests
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)

        except anthropic.AuthenticationError:
            # 401 — bad API key
            print("Invalid API key. Check your ANTHROPIC_API_KEY.")
            raise

        except anthropic.BadRequestError as e:
            # 400 — invalid request (bad params, too many tokens, etc.)
            print(f"Bad request: {e}")
            raise

        except anthropic.APIError as e:
            # 500+ — server error
            if attempt < retries - 1:
                wait = 2 ** attempt
                print(f"Server error. Retrying in {wait}s...")
                time.sleep(wait)
            else:
                raise

    raise Exception("Max retries exceeded")

result = call_claude([{"role": "user", "content": "Hello"}])
print(result)

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function callClaude(
  messages: Anthropic.MessageParam[],
  retries = 3
): Promise<string> {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      const response = await client.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        messages,
      });

      return response.content[0].type === "text"
        ? response.content[0].text
        : "";
    } catch (error) {
      if (error instanceof Anthropic.RateLimitError) {
        const wait = Math.pow(2, attempt) * 1000;
        console.log(`Rate limited. Waiting ${wait}ms...`);
        await new Promise((r) => setTimeout(r, wait));
      } else if (error instanceof Anthropic.AuthenticationError) {
        console.log("Invalid API key.");
        throw error;
      } else if (error instanceof Anthropic.BadRequestError) {
        console.log(`Bad request: ${error.message}`);
        throw error;
      } else if (error instanceof Anthropic.APIError) {
        if (attempt < retries - 1) {
          const wait = Math.pow(2, attempt) * 1000;
          console.log(`Server error. Retrying in ${wait}ms...`);
          await new Promise((r) => setTimeout(r, wait));
        } else {
          throw error;
        }
      } else {
        throw error;
      }
    }
  }
  throw new Error("Max retries exceeded");
}

const result = await callClaude([{ role: "user", content: "Hello" }]);
console.log(result);

The Python and TypeScript SDKs both have built-in retry logic for rate limits. But custom retry handling gives you more control over the behavior.

Metadata for User Tracking

If you are building an app where multiple users call Claude, use the metadata parameter to track which user made each request.

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    metadata={"user_id": "user_12345"},
    messages=[
        {"role": "user", "content": "Help me debug this code"}
    ]
)

Anthropic uses this for abuse detection and rate limiting per user. It does not affect Claude’s response.

Long Context (1M Tokens)

Claude supports up to 1 million tokens of input context. This is a beta feature with specific requirements:

Access: Tier 4 API access required
Header: Include anthropic-beta: context-1m-2025-08-07 in your request
Pricing: Requests over 200K input tokens cost 2x for input, 1.5x for output
Models: Opus 4.6 and Sonnet 4.6 only

Python

import anthropic

client = anthropic.Anthropic()

# For requests over 200K tokens, add the beta header
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    extra_headers={"anthropic-beta": "context-1m-2025-08-07"},
    system="You are a code analyst. Analyze the entire codebase provided.",
    messages=[
        {"role": "user", "content": f"Here is the full codebase:\n\n{large_codebase}\n\nFind all security vulnerabilities."}
    ]
)

For most use cases, the standard 200K context window is enough. Use 1M context only when you genuinely need it — the cost premium adds up.

Practical Pattern: Conversation with Cost Tracking

Here is a complete example that combines multi-turn conversation, streaming, and cost tracking:

Python

import anthropic

client = anthropic.Anthropic()

conversation = []
total_input_tokens = 0
total_output_tokens = 0

def chat(user_message: str) -> str:
    global total_input_tokens, total_output_tokens

    conversation.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="You are a senior software engineer. Be concise and practical.",
        messages=conversation
    )

    assistant_text = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_text})

    total_input_tokens += response.usage.input_tokens
    total_output_tokens += response.usage.output_tokens

    cost = (total_input_tokens * 3.00 + total_output_tokens * 15.00) / 1_000_000
    print(f"[Tokens: {total_input_tokens} in, {total_output_tokens} out | Cost: ${cost:.4f}]")

    return assistant_text

# Use it
print(chat("What is the best way to handle errors in a REST API?"))
print(chat("Show me an example in FastAPI"))
print(chat("How do I add logging to this?"))

This pattern is useful for any interactive application where you want to monitor costs in real time.

Summary

Feature	How
Basic call	`client.messages.create(model, max_tokens, messages)`
Multi-turn	Send full message history in every request
Streaming	Use `client.messages.stream()` (Python) or `.stream()` (TS)
Token tracking	Check `response.usage.input_tokens` and `output_tokens`
Error handling	Catch specific error types, retry with backoff
Long context	Beta header + Tier 4 access for 1M tokens

The Messages API is simple but powerful. Master these patterns, and every other Claude feature will be easy to integrate.

What’s Next?

In the next article, we will cover Tool Use — teaching Claude to call your functions, get real data, and take actions.

Next: Tool Use (Function Calling)

Messages API Anatomy#

Python#

TypeScript#

The Response Object#

Stop Reasons#

Multi-Turn Conversations#

Python#

TypeScript#

Streaming Responses#

Python#

TypeScript#

Stream Event Types#

Token Usage Tracking#

Python#

TypeScript#

Pricing Reference (March 2026)#

Setting max_tokens#

Stop Sequences#

Python#

Error Handling#

Python#

TypeScript#

Metadata for User Tracking#

Long Context (1M Tokens)#

Python#

Practical Pattern: Conversation with Cost Tracking#

Python#

Summary#

What’s Next?#

Related Articles#

Messages API Anatomy

Python

TypeScript

The Response Object

Stop Reasons

Multi-Turn Conversations

Python

TypeScript

Streaming Responses

Python

TypeScript

Stream Event Types

Token Usage Tracking

Python

TypeScript

Pricing Reference (March 2026)

Setting max_tokens

Stop Sequences

Python

Error Handling

Python

TypeScript

Metadata for User Tracking

Long Context (1M Tokens)

Python

Practical Pattern: Conversation with Cost Tracking

Python

Summary

What’s Next?

Related Articles