8. Prompt Caching
Prompt caching lets Anthropic’s infrastructure reuse the KV-cache
(key-value pairs computed during the attention pass) for the prefix of a prompt
that stays the same across repeated requests. Cache hits are charged at roughly
10% of the normal input token price, reducing cost significantly for workloads
with large, stable context.
8.1 How It Works
You mark a content block with cache_control: {"type": "ephemeral"}
to tell the API “cache everything up to and including this block.”
On subsequent requests with the same prefix, the model skips recomputing the
attention for that prefix and reads the stored KV cache instead.
The cache has a 5-minute TTL. If you send the same prefix again within 5 minutes
you get a cache hit; after 5 minutes the cache expires and the next request is
a full-cost cache write.
8.2 Basic Usage
Cache the system prompt when it is large and reused across many requests:
import anthropic
client = anthropic.Anthropic()
LARGE_SYSTEM = "..." * 1000 # imagine 10,000+ tokens of standing instructions
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_SYSTEM,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Summarise section 3."}]
)
u = response.usage
print(f"cache write: {u.cache_creation_input_tokens}")
print(f"cache read: {u.cache_read_input_tokens}")
print(f"uncached: {u.input_tokens}")
On the first call cache_creation_input_tokens is non-zero (cache miss,
written at 1.25× normal cost). On subsequent calls within 5 minutes,
cache_read_input_tokens is non-zero instead (cache hit, charged at
0.1× normal cost).
8.3 Caching Multiple Blocks
You can place cache breakpoints at the end of each major stable section.
Anthropic supports up to 4 cache breakpoints per request. Mark each one at the
last content block you want included in that cache segment:
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": large_document,
"cache_control": {"type": "ephemeral"}}, # cache the document
{"type": "text", "text": "Answer this question: " + question}
]
}
]
8.4 When Caching Helps
-
Large system prompts. A 10,000-token standing instruction
block cached across 100 API calls reduces that component’s cost by 90%.
-
Document analysis. Load a large document once with a
cache breakpoint, then ask many questions in the same session.
-
Agent loops. The tool definitions and conversation
history up to a stable point can be cached; only the latest tool result needs
full processing.
-
Not helpful for one-shot requests. If each request has
a different prefix, there are no cache hits and you pay the write premium
(1.25×) for nothing.
8.5 References
| Resource | Description |
|
Prompt Caching Guide
|
Full documentation including supported models and pricing details. |
| Next: Tool Use |
Tool definitions, the tool-use loop, and multi-tool patterns. |