AI Story: Tokens & Context

2. Tokens & Context

The fundamental unit of communication with an LLM is not a word or a character — it is a token. Understanding tokens lets you reason about cost, predict context limits, and write prompts that stay within bounds.

2.1 What Is a Token?

A token is a chunk of text as defined by a tokenizer — a vocabulary of roughly 100,000 sub-word pieces trained alongside the model. Common English words are usually one token. Rare words, long compound words, and non-English text are split into multiple tokens. Whitespace and punctuation each contribute tokens too.

Rough rules of thumb (English text):

1 token ≈ 4 characters
100 tokens ≈ 75 words
1,000 tokens ≈ 750 words ≈ a 1–2 page document

Code tends to tokenize less efficiently than prose because identifiers, operators, and indentation all consume tokens. A 200-line Python file might be 400–800 tokens depending on identifier length and comment density.

2.2 The Context Window

The context window is the maximum number of tokens the model can attend to in a single inference call. It includes both the input (prompt, conversation history, tool results) and the output (the completion). Exceeding the window causes the API to return an error.

Context window sizes (approximate, as of mid-2026):

Claude 3.5 Haiku — 200K tokens input, 8K output
Claude Sonnet 4.x — 200K tokens input, 64K output
Claude Opus 4.x — 200K tokens input, 32K output
GPT-4o — 128K tokens input/output combined
Gemini 1.5 Pro — 1M tokens input

A 200K token window can hold roughly 150,000 words — about 500 pages of prose, or a medium-sized codebase. Still, it is finite. Long conversations accumulate history; multi-document RAG pipelines can fill the window quickly.

2.3 Token Pricing

API calls are billed per token, with input and output priced separately. Output tokens typically cost 3–5× more than input tokens because generating each token requires a full forward pass through the model, while reading input tokens is cheaper with KV-cache reuse.

The usage object in every API response reports exact token counts:

response = client.messages.create(...)
print(response.usage.input_tokens)
print(response.usage.output_tokens)
print(response.usage.cache_read_input_tokens)   # if caching enabled

Log these in development. Unexpected token counts are often the first sign of a runaway conversation loop or an unexpectedly large context block.

2.4 Prompt Design Implications

Token awareness shapes how you write prompts:

Be specific, not verbose. A focused 50-token instruction beats a padded 200-token one. The model does not reward length.
Trim conversation history. In multi-turn loops, old turns still consume tokens. Summarise or drop turns that are no longer relevant.
Place important content early. Attention is distributed across the window, but the first and last few hundred tokens tend to receive more weight (the “primacy/recency” effect). Put system instructions and key constraints first.
Reserve headroom for output. If max_tokens is set to 8K, the model stops generating at 8K even if the answer is incomplete. Set it high enough for the expected response, not the maximum possible.

2.5 References

Resource	Description
Claude Models	Current context window and token limits for all Claude models.
Token Counting	Anthropic’s API for counting tokens before sending a request.
Next: Prompting	How to structure messages, roles, and system prompts effectively.