Code Story

Code Story

arena, AI tools, agents, spec-driven development

0.0  Prologue

Code Track is a place to experiment — with language features, AI tools, multi-language project comparisons, and code-generation utilities. This story is a narrative guide through those experiments, ordered from the most hands-on (building a scratch arena with real measurement tools) through progressively more capable AI workflows: chat bots → CLI agents → API calls → tool-using agents → agentic pipelines → reusable skill libraries → spec-driven development.
Why study AI-assisted code development?
  1. AI tools shift the bottleneck from writing code to reviewing and guiding it. Understanding that shift makes you a more effective user of the tools.
  2. Knowing how LLMs work — message roles, context windows, tool use, prompt caching — lets you use them reliably instead of hoping they produce what you want.
  3. Agents with tool use can read your files, run your tests, and iterate without manual steps between each action.
  4. Spec-driven development inverts the usual workflow: you write what the code must do before the code exists, and the spec becomes the AI’s contract.
  5. Comparing the same project in Rust, C++, C#, and Python builds intuition about what language features actually cost in code size, complexity, and runtime performance.
The story uses two recurring example projects — TextFinder (search a directory tree for regex matches) and PageValidator (check HTML structural correctness) — alongside the AI/ folder demos to keep discussion grounded in real, runnable code.

0.1  Getting Started

Install these tools in order — each chapter builds on what came before.
  1. VS Code with language extensions:
    rust-analyzer (Rust), clangd (C++), C# Dev Kit (C#), Pylance (Python), Error Lens and GitLens (all languages).
  2. At least one language toolchain:
    Rust, a C++ compiler (MSVC, GCC, or Clang), .NET SDK for C#, or Python 3.
  3. Git and a GitHub account.
  4. An Anthropic API key (required for chapters 4–8):
    Create one at console.anthropic.com and set it as the environment variable ANTHROPIC_API_KEY.
  5. Python package: anthropic
    pip install anthropic

0.2  Story Content

The story is ordered so each chapter provides vocabulary and tools used in the next. Start at the beginning or jump to any chapter — each is written to be readable on its own.
  1. Prologue

    Motivation, getting started, chapter index, and references.
  2. Experimenting with Code

    Building a scratch arena with build chains, metrics tools, performance timers, and visualizers.
  3. Chat Bots

    Using Claude, ChatGPT, and Gemini through the browser to analyze, generate, and document code.
  4. Code AI CLI

    Claude Code and Gemini CLI as terminal-based coding partners — reading files, making changes, multi-step tasks, hooks.
  5. LLM API

    Calling Claude directly from Python: the messages API, structured output, streaming, and prompt caching.
  6. Agent AI

    Tool use, the agentic loop, a working file-reading agent, error handling, and safety constraints.
  7. Agentic AI

    Multi-step autonomous workflows: analyze → plan → generate → test, chaining agent calls, human-in-the-loop checkpoints.
  8. Skills AI

    Extending agents with a reusable tool library: anatomy of a skill, a code_metrics example, composing multiple tools in one session.
  9. Spec-Driven Development

    Using Constitution.md, Structure.md, and Spec.md to drive AI implementation; the full workflow from spec to code to validation.

0.3  References

Resource Description
Anthropic Docs Full API reference, model guides, and prompt engineering tips.
Claude Code Anthropic’s CLI tool — install, keyboard reference, and docs.
CodeBites Introduction Track page that maps the full CodeBites page sequence.
AI Links Curated links to AI tools, documentation, and research.

1.  Experimenting with Code

Before using AI to help write code you need a place to run and measure code quickly. This chapter sets up a scratch arena — a lightweight, disposable workspace where experiments are cheap, failure is expected, and anything worth keeping gets promoted to a real project.
Why isolate experiments?
  1. Real projects carry pressure: tests must pass, CI must be green, code review is watching. A scratch arena has none of that friction.
  2. Separating “trying something” from “building something” keeps both cleaner — the real project stays stable while experiments fail fast.
  3. A version-controlled scratch space lets you revisit a dead end without losing it, and compare two approaches side by side.
  4. Measuring code from the start — lines, complexity, timing — builds the habit of treating metrics as feedback, not bureaucracy.

1.1  Arena Layout

A flat directory at the root of your workspace gives each language its own folder. The metrics/ and notes/ folders accumulate output and observations across sessions.
sandbox/
  rust/        ← cargo workspace, or cargo new per experiment
  cpp/         ← single-file programs; no CMake needed for scratch
  csharp/      ← dotnet new console -o scratch; reuse between runs
  python/      ← flat scripts; one .venv at this level
  metrics/     ← tokei and code_metrics output, saved per session
  notes/       ← scratchpad.md, one entry per session
Quick project init per language:
## Rust
cargo new scratch && cd scratch
cargo run

## C++  (no project file needed for single-file experiments)
g++ -std=c++23 -Wall -o out main.cpp && ./out

## C#
dotnet new console -o scratch && cd scratch
dotnet run

## Python  (Windows; use forward slash on macOS/Linux)
python -m venv .venv
.venv\Scripts\activate
python script.py
python -i script.py     # run then drop into REPL with all names live

1.2  Code Metrics

Metrics tools answer “how big?” and “how complex?” before and after a change. Run them at the start of a session to establish a baseline, and after each significant change to see what moved. code_metrics.py (in track) — Reports line counts, function counts, and blank/comment ratios per file. Good for tracking growth of a scratch project over sessions. tokei — Cross-language line counter; breaks down by language and file type. Run at the sandbox root to see the whole arena at a glance.
cargo install tokei
scc (Sloc Cloc and Code) — Like tokei but adds estimated complexity and cost columns. Useful when comparing equivalent programs across languages.
cargo install scc
radon (Python) — Cyclomatic complexity and maintainability index per function.
pip install radon  —  run: radon cc -s script.py
cargo clippy (Rust) — Catches non-obvious mistakes and style issues that the compiler misses. Treat warnings as a quality signal, not a failure condition. cppcheck (C++) — Static analysis for undefined behavior, memory issues, and style.
cppcheck --enable=all main.cpp

1.3  Performance and Timing

hyperfine — Cross-language wall-clock benchmarker. Runs a command N times, warms the cache, and reports mean ± stddev with outlier detection. Use it to compare the same algorithm across languages directly.
cargo install hyperfine
hyperfine './out input.txt' 'python script.py input.txt'
tf_timer.py / pa_timer.py (in track) — Purpose-built timing wrappers for TextFinder and PageValidator. Model these for any experiment that needs repeated-run averaging. Python cProfile (built-in) — Function-level profiler; no install needed. Pairs with snakeviz (see section 1.4) for visual output.
python -m cProfile -s cumtime script.py
cargo flamegraph (Rust) — Generates a flame graph SVG showing where CPU time actually goes. Requires perf (Linux) or DTrace (macOS); works in WSL on Windows.
cargo install flamegraph
BenchmarkDotNet (C#) — Add as a NuGet package; annotate methods with [Benchmark]. Produces statistically rigorous tables with warmup and GC stats.

1.4  Visualizers

CodeWebifier (in track) — Converts source files to syntax-highlighted HTML for display in the site. Run it on any experiment worth keeping to publish it. snakeviz (Python) — Browser-based flame graph for cProfile output.
pip install snakeviz
python -m cProfile -o out.prof script.py
snakeviz out.prof           # opens browser with interactive flame graph
graphviz / dot — Renders dependency graphs, state machines, and call graphs as SVG. Most build tools can emit .dot format (cargo metadata, doxygen).
winget install graphviz
cargo doc --open (Rust) — Not just for publishing — generated doc pages are the fastest way to browse the public API of any crate you add to an experiment. VS Code extensions worth installing for the arena:
  • Error Lens — inline error messages next to the offending line
  • GitLens — last-edit blame inline; useful for tracking what changed in scratch
  • CodeMetrics (kisstkondoros) — per-function complexity score in the editor gutter
  • Rust Analyzer — essential for Rust; includes inlay type hints
  • clangd — C++ language server with inline diagnostics and completions

1.5  A Typical Session Flow

A short ritual at the start and end of each session keeps the arena useful:
  1. Create or reset the scratch project for today’s language.
  2. Write the simplest version that compiles and runs (15–20 lines max).
  3. Run metrics: tokei and code_metrics.py for a baseline snapshot.
  4. Iterate: change one thing, re-run, compare metrics and timing.
  5. If the result is worth keeping, run CodeWebifier and move it to a named folder; otherwise delete and move on.
  6. Add one line to notes/scratchpad.md: what you tried, what you learned.

1.6  References

Resource Description
hyperfine Cross-platform benchmarking tool for command-line programs.
tokei Fast, accurate code line counter with language breakdown.
scc Line counter with complexity and estimated cost columns.
radon Python complexity metrics: cyclomatic complexity, maintainability index.
cargo flamegraph Flame graph profiler for Rust programs.
BenchmarkDotNet Rigorous microbenchmark framework for .NET.
snakeviz Browser-based viewer for Python cProfile output.

2.  Chat Bots

A chat bot session is the simplest form of AI-assisted development: open a browser, describe a problem in plain language, and get a response. No API key, no install, no code required. This chapter covers what chat bots are good for, how to prompt them effectively for code tasks, and when to move to a more capable tool.
Why start with chat bots?
  1. Zero setup — the fastest path from a question to an answer.
  2. Good for understanding an unfamiliar API, pattern, or error message before writing any code.
  3. Prompting for a chat bot and prompting for the API use the same vocabulary — roles, context, constraints — so skills transfer directly.
  4. The limitations of a chat session (no file access, fixed context window, no tool use) make the step up to a CLI agent feel motivated rather than arbitrary.

2.1  Prompting for Code Analysis

Give the bot the code, then ask a specific question. Vague prompts get vague answers. Effective patterns:
  • “What does this function do? Focus on the return value.”
  • “What would happen if I passed an empty slice to this function?”
  • “Explain the ownership rules that apply to this block.”
  • “What is the time complexity of this algorithm and why?”

2.2  Prompting for Code Generation

State the function signature, the inputs, the expected output, and any constraints. The more precise the spec in the prompt, the less revision the output needs.
  • Always specify the language and version (e.g., “Python 3.12”, “C++23”, “Rust 2021 edition”).
  • Include one concrete example: input → expected output.
  • State what the code must NOT do (allocate, panic, use unsafe, etc.).

2.3  Prompting for Documentation

Chat bots produce good first-draft documentation when given the code and the audience.
  • “Write a one-paragraph description of this module for a README.”
  • “Write a doc comment for each public function in this file.”
  • “Summarize this diff in three sentences for a commit message.”

2.4  Limitations and When to Move On

Know when to switch tools:
  • Context window fills up — the bot forgets earlier parts of a long conversation.
  • No file access — you must paste code manually; large codebases are impractical.
  • No tool use — it can suggest a command but cannot run it or read the output.
  • Hallucination risk — always run generated code before trusting it.
When you hit these limits, move to a CLI agent (Chapter 3) or the API (Chapter 4).

2.5  References

ResourceDescription
Claude Anthropic’s chat interface. Best for long-context code tasks.
ChatGPT OpenAI’s chat interface. Large model family with code interpreter.
Gemini Google’s chat interface. Strong at multi-modal and search-grounded tasks.
CodeBites: Chat Bots Track page with example sessions and prompt templates.

3.  Code AI CLI

A Code AI CLI runs in your terminal alongside your code. It can read your files, write changes, run commands, and navigate your repository — all without leaving the shell. This chapter covers Claude Code and Gemini CLI as terminal-based coding partners.
Why the CLI over the browser?
  1. The CLI has direct access to your filesystem — no pasting required.
  2. It can run shell commands and read their output, closing the loop between code and behavior.
  3. Multi-step tasks (read file, analyze, edit, run tests) happen in a single session without context loss.
  4. Hooks let you automate recurring actions: run linters before each edit, show a summary when the session ends.

3.1  Starting a Session

Run claude in the root of your project. On the first run in a new repository, use /init to generate a CLAUDE.md file describing the project structure. Subsequent sessions load that file automatically, giving the model context without you re-explaining it each time.

3.2  Making Changes

Describe what you want in plain language. The CLI reads relevant files, proposes a diff, and waits for your approval. You see every change before it lands. Use /diff to review pending edits at any point in the session.

3.3  Multi-Step Tasks

Compound tasks work well because the CLI maintains context across steps. Example session sequence:
  1. Read src/main.rs and summarize the data flow.
  2. Identify the three functions with the highest cyclomatic complexity.
  3. Refactor the most complex function into two smaller ones.
  4. Run cargo test and fix any failures.
Each step uses output from the previous one. The context window is the limit — for very large refactors, split into smaller tasks.

3.4  Hooks and Automation

Claude Code hooks run shell commands automatically on events such as tool calls, session start, and session end. Configure them in .claude/settings.json under the hooks key. Useful hooks:
  • Run cargo clippy after every file edit
  • Display a summary of changed files when the session closes
  • Block writes outside an allowed directory whitelist

3.5  References

ResourceDescription
Claude Code Anthropic’s CLI tool — install, docs, and keyboard reference.
Gemini CLI Google’s terminal AI tool, open source.
CodeBites: Code AI Track page with recorded CLI sessions and technique notes.

4.  LLM API

Calling Claude directly from code gives you full programmatic control: you choose the model, the system prompt, the context, the output format, and whether to stream. This chapter covers the Anthropic Python SDK from a minimal completion through structured output, streaming, and prompt caching.
Why write code that calls the API?
  1. You can embed AI calls inside larger programs: scripts, batch processors, analysis pipelines.
  2. You control the system prompt precisely — the AI’s persona, constraints, and output format are not left to the chat interface defaults.
  3. Structured output (JSON) lets you parse and act on AI responses programmatically.
  4. Prompt caching reduces cost and latency when the same large context is reused across many calls.

4.1  A Minimal Completion

import anthropic

client = anthropic.Anthropic()   # reads ANTHROPIC_API_KEY from environment

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Explain Rust ownership in three sentences."
    }]
)
print(response.content[0].text)

4.2  Structured Output

Put the JSON schema in the system prompt; the model returns JSON you can parse directly. Validate with pydantic for robustness.
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system='Respond only with valid JSON matching {"summary": str, "complexity": int}.',
    messages=[{"role": "user", "content": code_text}]
)
import json
result = json.loads(response.content[0].text)

4.3  Streaming

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

4.4  Prompt Caching

Mark large, reused context blocks with cache_control to avoid re-processing them on every call. Cached tokens cost roughly 10× less and return faster. Useful when every call in a session loads the same large file.
messages=[{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": large_context,
            "cache_control": {"type": "ephemeral"}
        },
        {"type": "text", "text": "Summarize the above."}
    ]
}]

4.5  References

ResourceDescription
Anthropic API Docs Messages API reference, models, rate limits, and SDK guides.
CodeBites: LLM API Track page with extended API examples and patterns.

5.  Agent AI

An agent is a program that calls an LLM in a loop, giving the model access to tools (functions it can invoke), and continuing until the model decides it has finished. This chapter covers the tool-use pattern, the agentic loop, a working file-reading agent, and safety constraints.
Why agents instead of single calls?
  1. A single call cannot react to its own output. An agent can read a file, see what’s in it, and decide what to read next.
  2. Tool use turns the model into an orchestrator: it plans, delegates to tools, and synthesizes results — all without your intervention.
  3. Agents can retry failed steps, ask clarifying questions, and handle unexpected inputs gracefully.

5.1  Defining a Tool

tools = [{
    "name": "read_file",
    "description": "Read a source file and return its contents as a string.",
    "input_schema": {
        "type": "object",
        "properties": {
            "path": {"type": "string", "description": "Relative path to the file"}
        },
        "required": ["path"]
    }
}]

5.2  The Tool Loop

import anthropic, pathlib

client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Summarize the file main.py."}]

while True:
    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=2048,
        tools=tools, messages=messages
    )
    if resp.stop_reason == "end_turn":
        print(resp.content[0].text)
        break
    for block in resp.content:
        if block.type == "tool_use":
            text = pathlib.Path(block.input["path"]).read_text()
            messages += [
                {"role": "assistant", "content": resp.content},
                {"role": "user", "content": [{
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": text
                }]}
            ]

5.3  Safety Constraints

An unconstrained agent can do more than intended. Useful guards:
  • Cap iterations: for _ in range(MAX_STEPS) — raise an error if exceeded
  • Whitelist allowed paths: reject tool calls outside the sandbox directory
  • Separate read tools from write tools and require explicit user confirmation before any write tool runs
  • Log every tool call and its result to a file for post-session review

5.4  References

ResourceDescription
Tool Use Docs Anthropic’s guide to defining and using tools with the Claude API.
CodeBites: Agent AI Track page with agent demos and design notes.

6.  Agentic AI

An agentic workflow chains multiple agent calls together, with the output of one becoming the input of the next. Each call can have a different focus: analyze, plan, generate, test. This chapter covers how to structure multi-step workflows, how to pass state between calls, and where to add human-in-the-loop checkpoints.
When does autonomy help?
  1. When the task has multiple clearly-ordered steps and each step’s output is the next step’s input.
  2. When you want repeatability — the same workflow produces the same kind of output regardless of who runs it.
  3. When the individual steps are too small to justify a full CLI session but too many to do manually each time.

6.1  Analyze → Plan → Generate → Test

A four-call workflow for producing a new module:
  1. Analyze — Read existing code; produce a JSON summary of types, functions, and dependencies.
  2. Plan — Feed the summary to a second call; ask for a numbered implementation plan as JSON.
  3. Generate — Feed the plan to a third call; ask for source code, one file at a time.
  4. Test — Run the generated code, capture stdout/stderr, feed errors back into a fourth call for a fix.

6.2  Passing State Between Calls

Use plain Python data structures to carry results forward. A dict or dataclass per step is enough for most workflows. For long-running pipelines, serialize to a JSON file so a failed step can be retried without re-running the earlier ones.

6.3  Human-in-the-Loop Checkpoints

Pause after high-risk steps for user confirmation. A simple input("Continue? [y/N] ") is enough for a personal script. For production workflows, write the plan to a file and require an explicit approval file before the generate step runs.

6.4  References

ResourceDescription
CodeBites: Agentic AI Track page with multi-step workflow examples.

7.  Skills AI

A skill is a reusable, named tool definition. Building a skill library means you write a tool once and use it across many agents and sessions. This chapter covers the anatomy of a skill, a code_metrics example, and composing multiple skills in a single agent call.
Why build a skill library?
  1. Copy-pasting tool definitions into every script creates maintenance debt. A shared library means a fix reaches every agent at once.
  2. Named, well-described skills are self-documenting — the model reads the description and knows what the tool does without explanation in the prompt.
  3. A library of composable skills lets you assemble new agents quickly from existing parts.

7.1  Anatomy of a Skill

def skill_code_metrics(path: str) -> dict:
    """Count lines, functions, and blank lines in a Python source file.
    Returns {"lines": int, "functions": int, "blanks": int}.
    """
    import ast, pathlib
    src = pathlib.Path(path).read_text()
    tree = ast.parse(src)
    functions = sum(1 for n in ast.walk(tree) if isinstance(n, ast.FunctionDef))
    lines = src.count("\n")
    blanks = sum(1 for ln in src.splitlines() if not ln.strip())
    return {"lines": lines, "functions": functions, "blanks": blanks}

SKILL_CODE_METRICS = {
    "name": "code_metrics",
    "description": skill_code_metrics.__doc__,
    "input_schema": {
        "type": "object",
        "properties": {"path": {"type": "string"}},
        "required": ["path"]
    }
}

7.2  Composing Skills

Pass a list of skill definitions to a single agent call. The model chooses which tools to invoke and in what order. Keep each skill focused on one action — composition happens at the agent level, not inside a skill.
all_skills = [SKILL_READ_FILE, SKILL_CODE_METRICS, SKILL_LIST_DIR]

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    tools=all_skills,
    messages=[{
        "role": "user",
        "content": "Report metrics for every .py file in src/."
    }]
)

7.3  References

ResourceDescription
CodeBites: Skills AI Track page with skill library design and examples.
Anthropic Tool Use Docs Reference for tool definition schema and multi-tool sessions.

8.  Spec-Driven Development

Spec-driven development reverses the usual order: you write what the software must do before you write any code, then hand those documents to an AI to drive implementation. This chapter covers three spec files — Constitution.md, Structure.md, and Spec.md — and the workflow that uses them.
Why specs before prompts?
  1. A prompt without a spec is ambiguous. A spec makes the constraints explicit and auditable before the AI writes a line of code.
  2. The spec becomes the acceptance criterion: generated code that violates the spec is wrong by definition, regardless of whether it compiles.
  3. Separating “what must be true” (Constitution) from “how it is organized” (Structure) from “what each piece does” (Spec) keeps each document small and focused.
  4. Specs are reusable — the same Constitution can govern implementations in multiple languages.

8.1  Constitution.md

States values and hard constraints that apply to all generated code. Written in plain imperative sentences; one page maximum. Examples:
  • Never use unwrap() in library code.
  • All public functions must have a doc comment.
  • No external dependencies beyond the standard library.
  • Functions must not exceed 30 lines.

8.2  Structure.md

Defines the package layout, file names, module boundaries, and dependency rules. The AI reads this before generating any file so it knows where each piece belongs.
## Package Layout
src/
  lib.rs       -- public API re-exports only
  config.rs    -- configuration types; no I/O
  scanner.rs   -- directory traversal; depends on config only
  matcher.rs   -- regex matching; depends on config only
  reporter.rs  -- output formatting; depends on scanner and matcher
  main.rs      -- entry point; depends on all others

## Rules
- scanner.rs must not import from matcher.rs
- reporter.rs must not perform I/O beyond writing to a provided Writer

8.3  Spec.md

Documents each public function or type: signature, preconditions, postconditions, and one example. Written before implementation; updated when requirements change.
## fn scan(root: &Path, config: &Config) -> Result<Vec<Match>>
- root must exist and be a directory; returns Err otherwise
- traverses root recursively, following symlinks if config.follow_symlinks
- returns all Match records where config.pattern matches file content
- excludes files whose extension is not in config.extensions
- example: scan(Path::new("src"), &cfg) -> Ok(vec![Match{path: ..., line: 3}])

8.4  The Workflow

  1. Write Constitution.md — values and constraints, no code yet.
  2. Write Structure.md — package layout and dependency rules.
  3. Write Spec.md — signatures and contracts for each public item.
  4. Start a CLI session or API call; load all three files as context.
  5. Ask the AI to implement one file at a time, citing the spec for each function.
  6. Run tests after each file; feed failures back into the session.
  7. After all files pass, ask the AI to verify the implementation against Constitution.md and report any violations.

8.5  Epilogue — Connecting to the SWDev Track

Spec-driven development is the Code Track’s answer to the SWDev track’s design chapter: Constitution maps to architectural constraints, Structure maps to package design, and Spec maps to specification. The difference is the AI is now the implementer. For a deeper treatment of the design concepts behind these documents, see the SWDev Story: Software Design chapter.

8.6  References

ResourceDescription
CodeBites: Spec-Driven Development Track page with full workflow examples and template files.
SWDev Story: Software Design The design chapter that underpins spec-driven development.
Anthropic Docs Full API reference, model guides, and prompt engineering tips.