2. Tokens & Context
The fundamental unit of communication with an LLM is not a word or a character
— it is a token. Understanding tokens lets you reason about cost,
predict context limits, and write prompts that stay within bounds.
2.1 What Is a Token?
A token is a chunk of text as defined by a tokenizer — a vocabulary of
roughly 100,000 sub-word pieces trained alongside the model. Common English words
are usually one token. Rare words, long compound words, and non-English text are
split into multiple tokens. Whitespace and punctuation each contribute tokens too.
Rough rules of thumb (English text):
- 1 token ≈ 4 characters
- 100 tokens ≈ 75 words
- 1,000 tokens ≈ 750 words ≈ a 1–2 page document
Code tends to tokenize less efficiently than prose because identifiers,
operators, and indentation all consume tokens. A 200-line Python file might be
400–800 tokens depending on identifier length and comment density.
2.2 The Context Window
The context window is the maximum number of tokens the model can attend to in a
single inference call. It includes both the input (prompt, conversation history,
tool results) and the output (the completion). Exceeding the window causes the
API to return an error.
Context window sizes (approximate, as of mid-2026):
- Claude 3.5 Haiku — 200K tokens input, 8K output
- Claude Sonnet 4.x — 200K tokens input, 64K output
- Claude Opus 4.x — 200K tokens input, 32K output
- GPT-4o — 128K tokens input/output combined
- Gemini 1.5 Pro — 1M tokens input
A 200K token window can hold roughly 150,000 words — about 500 pages of
prose, or a medium-sized codebase. Still, it is finite. Long conversations
accumulate history; multi-document RAG pipelines can fill the window quickly.
2.3 Token Pricing
API calls are billed per token, with input and output priced separately.
Output tokens typically cost 3–5× more than input tokens because
generating each token requires a full forward pass through the model, while
reading input tokens is cheaper with KV-cache reuse.
The usage object in every API response reports exact token counts:
response = client.messages.create(...)
print(response.usage.input_tokens)
print(response.usage.output_tokens)
print(response.usage.cache_read_input_tokens) # if caching enabled
Log these in development. Unexpected token counts are often the first sign of
a runaway conversation loop or an unexpectedly large context block.
2.4 Prompt Design Implications
Token awareness shapes how you write prompts:
-
Be specific, not verbose. A focused 50-token instruction
beats a padded 200-token one. The model does not reward length.
-
Trim conversation history. In multi-turn loops, old turns
still consume tokens. Summarise or drop turns that are no longer relevant.
-
Place important content early. Attention is distributed
across the window, but the first and last few hundred tokens tend to receive
more weight (the “primacy/recency” effect). Put system instructions
and key constraints first.
-
Reserve headroom for output. If
max_tokens
is set to 8K, the model stops generating at 8K even if the answer is incomplete.
Set it high enough for the expected response, not the maximum possible.
2.5 References
| Resource | Description |
|
Claude Models
|
Current context window and token limits for all Claude models. |
|
Token Counting
|
Anthropic’s API for counting tokens before sending a request. |
| Next: Prompting |
How to structure messages, roles, and system prompts effectively. |