AI Story

AI Story: Streaming

server-sent events, delta accumulation, error handling

7.  Streaming

Without streaming, your code blocks until the model finishes generating the entire response, then receives it all at once. For a 2,000-token response at typical inference speeds that is a 10–30 second wait with no feedback. Streaming delivers tokens as they are generated, which is essential for interactive applications and makes long responses feel responsive.

7.1  How It Works

The API uses server-sent events (SSE): a persistent HTTP connection over which the server pushes a sequence of event objects. Each event carries a delta containing a small chunk of text (typically 1–5 tokens). The stream ends with a final event that carries usage statistics and the stop_reason.

7.2  Basic Streaming

The Python SDK exposes streaming via a context manager. Use client.messages.stream() instead of client.messages.create():
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain the Rust borrow checker."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    print()  # newline after stream ends

final = stream.get_final_message()
print(f"\nTokens: {final.usage.input_tokens} in, {final.usage.output_tokens} out")
stream.text_stream yields each text delta. The get_final_message() call after the context manager exits returns the complete accumulated response with usage data.

7.3  Raw Event Access

If you need the raw event objects (to detect tool use blocks, content block boundaries, or input token counts mid-stream), iterate over stream directly:
with client.messages.stream(...) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            print(event.delta.text, end="", flush=True)
        elif event.type == "message_delta":
            print(f"\nStop reason: {event.delta.stop_reason}")

7.4  Error Handling

Network interruptions mid-stream leave the response incomplete. A robust streaming loop catches exceptions and decides whether to retry or fail:
import anthropic

client = anthropic.Anthropic()
accumulated = []

try:
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            accumulated.append(text)
            print(text, end="", flush=True)
except anthropic.APIConnectionError:
    print("\n[stream interrupted]")
    # partial result is in accumulated

7.5  When to Use Streaming

  • Use streaming when a human is watching the output: chat interfaces, terminal agents, progress indicators.
  • Skip streaming for batch processing where the full response is needed before the next step can start (structured output parsing, tool dispatch, evaluation pipelines). Non-streaming is simpler code for those cases.

7.6  References

ResourceDescription
Streaming Reference Complete event schema for the streaming Messages API.
Next: Prompt Caching How to cache large context blocks to reduce cost and latency.