AI Story

AI Story: Prompt Caching

cache_control, what gets cached, cost patterns, hit rate

8.  Prompt Caching

Prompt caching lets Anthropic’s infrastructure reuse the KV-cache (key-value pairs computed during the attention pass) for the prefix of a prompt that stays the same across repeated requests. Cache hits are charged at roughly 10% of the normal input token price, reducing cost significantly for workloads with large, stable context.

8.1  How It Works

You mark a content block with cache_control: {"type": "ephemeral"} to tell the API “cache everything up to and including this block.” On subsequent requests with the same prefix, the model skips recomputing the attention for that prefix and reads the stored KV cache instead. The cache has a 5-minute TTL. If you send the same prefix again within 5 minutes you get a cache hit; after 5 minutes the cache expires and the next request is a full-cost cache write.

8.2  Basic Usage

Cache the system prompt when it is large and reused across many requests:
import anthropic

client = anthropic.Anthropic()

LARGE_SYSTEM = "..." * 1000   # imagine 10,000+ tokens of standing instructions

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Summarise section 3."}]
)

u = response.usage
print(f"cache write: {u.cache_creation_input_tokens}")
print(f"cache read:  {u.cache_read_input_tokens}")
print(f"uncached:    {u.input_tokens}")
On the first call cache_creation_input_tokens is non-zero (cache miss, written at 1.25× normal cost). On subsequent calls within 5 minutes, cache_read_input_tokens is non-zero instead (cache hit, charged at 0.1× normal cost).

8.3  Caching Multiple Blocks

You can place cache breakpoints at the end of each major stable section. Anthropic supports up to 4 cache breakpoints per request. Mark each one at the last content block you want included in that cache segment:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": large_document,
             "cache_control": {"type": "ephemeral"}},   # cache the document
            {"type": "text", "text": "Answer this question: " + question}
        ]
    }
]

8.4  When Caching Helps

  • Large system prompts.   A 10,000-token standing instruction block cached across 100 API calls reduces that component’s cost by 90%.
  • Document analysis.   Load a large document once with a cache breakpoint, then ask many questions in the same session.
  • Agent loops.   The tool definitions and conversation history up to a stable point can be cached; only the latest tool result needs full processing.
  • Not helpful for one-shot requests.   If each request has a different prefix, there are no cache hits and you pay the write premium (1.25×) for nothing.

8.5  References

ResourceDescription
Prompt Caching Guide Full documentation including supported models and pricing details.
Next: Tool Use Tool definitions, the tool-use loop, and multi-tool patterns.