AI Story

AI Story: AI Concepts

taxonomy, training, inference, what LLMs actually do

1.  AI Concepts

Before writing any API code it helps to have the vocabulary straight. Terms like “machine learning”, “large language model”, and “generative AI” are used loosely in the industry. This chapter pins down what each term means in the context of the tools covered in this story.

1.1  Taxonomy

The terms nest inside each other:
  • Artificial Intelligence (AI) — systems that perform tasks normally associated with human cognition: perception, reasoning, language, decision-making.
  • Machine Learning (ML) — a subset of AI where the system learns from data rather than being explicitly programmed. The programmer designs the training loop; the parameters are learned.
  • Deep Learning — ML using neural networks with many layers. Layers progressively extract higher-level features from raw input.
  • Large Language Model (LLM) — a deep learning model trained on text at massive scale. The “large” refers to parameter count (billions to trillions) and training data (trillions of tokens).
  • Generative AI — AI that produces new content (text, images, code, audio) rather than classifying or predicting from fixed categories. LLMs are the dominant form for text and code.

1.2  Training vs. Inference

The two operational phases of a neural network are distinct and it matters which one you are thinking about when debugging or designing.
  • Training — the model sees billions of examples and adjusts billions of parameters to minimise prediction error. This is done once (or periodically) by the model provider. It is compute-intensive, takes weeks or months, and is not something an API user controls.
  • Inference — the trained model processes a new input and produces an output. This is what happens every time you call the API. Each API call is one inference pass (or a sequence of them for multi-turn conversations).
  • Fine-tuning — additional training on a smaller, task-specific dataset after the base model is trained. Some providers expose this as an API feature; others (Anthropic, as of 2026) do not.
As an API user, you are always in the inference phase. You cannot change the model’s parameters — only the input (the prompt) you give it.

1.3  How LLMs Work (the short version)

All major LLMs are built on the transformer architecture (introduced by Google in 2017). The key operation is attention: every token in the input can attend to (draw information from) every other token in the context window. This is what makes LLMs sensitive to word order, long-range dependencies, and prompt structure in ways that earlier sequence models were not. At inference time the model takes all the tokens in the prompt, runs them through many attention layers, and outputs a probability distribution over the vocabulary for the next token. It samples from that distribution, appends the token to the sequence, and repeats until a stop condition is reached. The entire output is generated one token at a time, left to right.
What “understanding” means (and doesn’t mean) LLMs do not have beliefs, goals, or world-models in the way humans do. They have very good statistical models of text. When an LLM produces a correct explanation of a concept, it is because text about that concept in its training data correlates with that explanation — not because it reasoned from first principles. This matters for reliability: LLMs are confident and fluent even when wrong. Validating outputs is always the programmer’s responsibility.

1.4  Knowledge Cutoff and Grounding

An LLM’s knowledge is frozen at its training cutoff date. It does not browse the web or read your files unless given tools to do so (Chapter 9). For tasks that require current information or private data, you must supply that information in the prompt or via tool results. Supplying facts in the prompt is called grounding or retrieval-augmented generation (RAG). The model can reason over whatever text you place in its context window — it just cannot retrieve that text itself.

1.5  References

ResourceDescription
Attention Is All You Need The 2017 paper introducing the transformer architecture.
Intro to Claude Anthropic’s introduction to the Claude model family.
Next: Tokens & Context How the input is broken into tokens and what the context window means.