5. Messages API
The Messages API is the primary way to call an LLM from code.
Every interaction — single-turn completions, multi-turn conversations,
tool-using agents — goes through the same endpoint. Knowing the exact
shape of the request and response makes debugging straightforward.
5.1 Minimal Request
The four required fields are model, max_tokens,
messages, and at least one message with role and
content.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "What is a transformer architecture?"}
]
)
print(response.content[0].text)
5.2 Request Parameters
model — exact model ID string (pin to a version in production).
max_tokens — maximum output tokens. The model may produce
fewer; it will never produce more.
messages — list of {role, content} objects,
alternating user/assistant.
system — system prompt string (optional). Sent separately
from messages.
temperature — 0.0–1.0. Lower values produce more
deterministic output; higher values more varied. Default is 1.0.
top_p — nucleus sampling threshold. Rarely needed;
adjust temperature instead.
stop_sequences — list of strings that cause the model to
stop generating when encountered.
tools — tool definitions for function calling (Chapter 9).
5.3 Response Structure
The response object has these key fields:
response.id # unique message ID
response.model # model that generated the response
response.stop_reason # "end_turn" | "max_tokens" | "stop_sequence" | "tool_use"
response.content # list of content blocks
response.usage.input_tokens
response.usage.output_tokens
# Extract text from the first content block:
text = response.content[0].text
Always check stop_reason. If it is "max_tokens" the
response was truncated — the answer is incomplete. Raise
max_tokens or break the task into smaller calls.
5.4 Multi-Turn Conversations
There is no session state on the server. You build a conversation by appending
previous turns to the messages list on each call.
messages = []
def chat(user_text):
messages.append({"role": "user", "content": user_text})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
assistant_text = response.content[0].text
messages.append({"role": "assistant", "content": assistant_text})
return assistant_text
print(chat("What is a monad?"))
print(chat("Give me a concrete example in Python."))
The messages list grows with every turn. Long conversations eventually
approach the context limit. Production applications summarise or truncate
old turns to stay within budget.
5.5 References