Code Story

Code Story: LLM API

messages API, structured output, streaming, prompt caching

4.  LLM API

Calling Claude directly from code gives you full programmatic control: you choose the model, the system prompt, the context, the output format, and whether to stream. This chapter covers the Anthropic Python SDK from a minimal completion through structured output, streaming, and prompt caching.
Why write code that calls the API?
  1. You can embed AI calls inside larger programs: scripts, batch processors, analysis pipelines.
  2. You control the system prompt precisely — the AI’s persona, constraints, and output format are not left to the chat interface defaults.
  3. Structured output (JSON) lets you parse and act on AI responses programmatically.
  4. Prompt caching reduces cost and latency when the same large context is reused across many calls.

4.1  A Minimal Completion

import anthropic

client = anthropic.Anthropic()   # reads ANTHROPIC_API_KEY from environment

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Explain Rust ownership in three sentences."
    }]
)
print(response.content[0].text)

4.2  Structured Output

Put the JSON schema in the system prompt; the model returns JSON you can parse directly. Validate with pydantic for robustness.
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system='Respond only with valid JSON matching {"summary": str, "complexity": int}.',
    messages=[{"role": "user", "content": code_text}]
)
import json
result = json.loads(response.content[0].text)

4.3  Streaming

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

4.4  Prompt Caching

Mark large, reused context blocks with cache_control to avoid re-processing them on every call. Cached tokens cost roughly 10× less and return faster. Useful when every call in a session loads the same large file.
messages=[{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": large_context,
            "cache_control": {"type": "ephemeral"}
        },
        {"type": "text", "text": "Summarize the above."}
    ]
}]

4.5  References

ResourceDescription
Anthropic API Docs Messages API reference, models, rate limits, and SDK guides.
CodeBites: LLM API Track page with extended API examples and patterns.