7. Streaming
Without streaming, your code blocks until the model finishes generating the entire
response, then receives it all at once. For a 2,000-token response at typical
inference speeds that is a 10–30 second wait with no feedback. Streaming
delivers tokens as they are generated, which is essential for interactive
applications and makes long responses feel responsive.
7.1 How It Works
The API uses server-sent events (SSE): a persistent HTTP connection over which
the server pushes a sequence of event objects. Each event carries a
delta containing a small chunk of text (typically 1–5 tokens).
The stream ends with a final event that carries usage statistics and the
stop_reason.
7.2 Basic Streaming
The Python SDK exposes streaming via a context manager. Use
client.messages.stream() instead of
client.messages.create():
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain the Rust borrow checker."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
print() # newline after stream ends
final = stream.get_final_message()
print(f"\nTokens: {final.usage.input_tokens} in, {final.usage.output_tokens} out")
stream.text_stream yields each text delta. The
get_final_message() call after the context manager exits returns
the complete accumulated response with usage data.
7.3 Raw Event Access
If you need the raw event objects (to detect tool use blocks, content block
boundaries, or input token counts mid-stream), iterate over
stream directly:
with client.messages.stream(...) as stream:
for event in stream:
if event.type == "content_block_delta":
print(event.delta.text, end="", flush=True)
elif event.type == "message_delta":
print(f"\nStop reason: {event.delta.stop_reason}")
7.4 Error Handling
Network interruptions mid-stream leave the response incomplete. A robust
streaming loop catches exceptions and decides whether to retry or fail:
import anthropic
client = anthropic.Anthropic()
accumulated = []
try:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
accumulated.append(text)
print(text, end="", flush=True)
except anthropic.APIConnectionError:
print("\n[stream interrupted]")
# partial result is in accumulated
7.5 When to Use Streaming
-
Use streaming when a human is watching the output: chat
interfaces, terminal agents, progress indicators.
-
Skip streaming for batch processing where the full response
is needed before the next step can start (structured output parsing, tool
dispatch, evaluation pipelines). Non-streaming is simpler code for those cases.
7.6 References