AI Story

AI Story: Reliability

rate limits, retry logic, token budgets, output validation, cost monitoring

11.  Reliability

A working prototype that calls the API once is not a reliable application. Production AI systems must handle transient failures, respect rate limits, stay within token budgets, validate unpredictable model output, and keep costs observable. This chapter covers the operational layer that makes an AI application trustworthy enough to ship.

11.1  Error Types

Anthropic’s Python SDK raises typed exceptions:
  • anthropic.RateLimitError (429) — too many requests per minute or too many tokens per minute.
  • anthropic.APIStatusError (5xx) — server-side transient error; safe to retry.
  • anthropic.APIConnectionError — network failure; retry with backoff.
  • anthropic.BadRequestError (400) — invalid request (malformed messages, invalid model ID); do not retry without fixing the request.
  • anthropic.AuthenticationError (401) — invalid or missing API key; do not retry.

11.2  Retry with Exponential Backoff

The SDK has built-in retry logic, but for agentic loops you often need more control. A simple backoff wrapper:
import time, anthropic

def call_with_retry(client, max_retries=5, **kwargs):
    delay = 1.0
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            time.sleep(delay)
            delay *= 2
        except anthropic.APIStatusError as e:
            if e.status_code >= 500 and attempt < max_retries - 1:
                time.sleep(delay)
                delay *= 2
            else:
                raise

11.3  Token Budget Management

Track cumulative token usage across an agent session to avoid surprise costs and to detect runaway loops early:
class TokenBudget:
    def __init__(self, max_tokens):
        self.max_tokens = max_tokens
        self.used = 0

    def record(self, usage):
        self.used += usage.input_tokens + usage.output_tokens

    def check(self):
        if self.used >= self.max_tokens:
            raise RuntimeError(
                f"Token budget exhausted: {self.used}/{self.max_tokens}"
            )

budget = TokenBudget(max_tokens=200_000)

response = client.messages.create(...)
budget.record(response.usage)
budget.check()   # raises if over budget

11.4  Output Validation

Never pass model output directly to security-sensitive code paths. The model can produce plausible-looking but wrong, incomplete, or malicious content.
  • Parse structured output with Pydantic (Chapter 6) and reject invalid shapes.
  • Check stop_reason == "end_turn" before trusting a response as complete.
  • For generated code: run it in a subprocess with a timeout and resource limits; do not exec() it directly.
  • For file path arguments from model output: validate against an allowlist of directories before passing to filesystem APIs.

11.5  Cost Monitoring

Build token accounting into your logging from day one:
import logging

logger = logging.getLogger("ai_app")

def log_usage(model, usage, task_name=""):
    logger.info(
        "api_call model=%s task=%s "
        "in=%d out=%d cache_write=%d cache_read=%d",
        model, task_name,
        usage.input_tokens,
        usage.output_tokens,
        getattr(usage, "cache_creation_input_tokens", 0),
        getattr(usage, "cache_read_input_tokens", 0)
    )
Aggregate logs in a dashboard or spreadsheet. Unexpected spikes in input tokens usually mean a conversation history is growing unchecked. Unexpected spikes in output tokens usually mean max_tokens is too low and the model is being cut off, causing retry loops.

11.6  References

ResourceDescription
API Errors Reference Complete list of error codes and recommended handling strategies.
Rate Limits Current rate limit tiers by model and usage tier.
Usage Dashboard Anthropic Console page for monitoring token and cost usage.
Back to Prologue Return to the chapter index.