Basics Story

Chapter #2 – CPUs

fetch-decode-execute, cores, caches, RAM, memory hierarchy, I/O

2.0  Prologue

A computer is a machine that fetches instructions from memory, decodes them, and executes them. Understanding the physical hardware helps explain why software behaves the way it does — why some operations are fast and others are slow, why memory layout matters, and why device drivers exist.

2.1  The Central Processing Unit (CPU)

The CPU is the brain of the computer. It executes a continuous fetch–decode–execute cycle: read the next instruction from memory, determine what it does, carry it out, advance the instruction pointer. Key CPU properties:
  • Clock speed (GHz): how many clock ticks per second; one instruction often completes in one to several ticks.
  • Core count: the number of independent execution units on the same chip. A 12-core CPU can execute 12 instruction streams simultaneously.
  • Hyper-threading / SMT: each physical core presents two logical cores to the OS, sharing execution units but not stalling during memory waits.
  • Out-of-order execution: the CPU reorders instructions internally to avoid stalls caused by slow memory reads.
  • Speculative execution: the CPU guesses the outcome of a branch and begins executing the predicted path before the branch is resolved.
  • SIMD (Single Instruction, Multiple Data): one instruction applies to a vector of values (e.g., add 8 floats at once with AVX).
CPU cache hierarchy:
Level Typical size Latency Scope
Registers bytes <0.5 ns per core
L1 cache 32–128 KB ~1 ns per core
L2 cache 256 KB–1 MB ~4 ns per core
L3 cache 4–64 MB ~20 ns shared across cores
RAM 8–128 GB ~60–100 ns whole system
Code implication: data that fits in L1 or L2 cache runs 20–100× faster than data that must come from RAM. Iterating over a contiguous array is cache-friendly; chasing pointers through a linked list is not.

2.2  Main Memory (RAM)

RAM (Random Access Memory) is byte-addressable, volatile (contents lost at power-off), and much slower than cache. Typical modern systems have 8–128 GB of DRAM (Dynamic RAM). Characteristics:
  • Access latency: ~60–100 ns, roughly 60–200 clock cycles on a modern CPU.
  • Bandwidth: DDR5 provides ~50–80 GB/s of aggregate bandwidth per memory channel.
  • NUMA (Non-Uniform Memory Access): on multi-socket servers, RAM attached to socket 0 is faster for socket 0’s cores than for socket 1’s cores. NUMA-unaware code can suffer a 2–4× penalty.
Every variable your program declares ultimately lives in RAM when it is not in a register or cache. The OS virtual memory system decides which pages are resident in RAM and which are paged out to disk.

2.3  Storage

Storage devices hold persistent data. Technology and access patterns differ widely:
Technology Capacity Random Read Latency Sequential Bandwidth
HDD (spinning disk) 1–20 TB ~5 ms ~150 MB/s
SATA SSD 0.5–4 TB ~0.1 ms ~550 MB/s
NVMe SSD (PCIe 4) 0.5–8 TB ~0.02 ms ~7 GB/s
RAM (reference) 8–128 GB ~0.0001 ms ~50 GB/s
Applications access storage through the OS file system API (open, read, write, close). The OS translates these to block-device operations, applies buffering, and manages the page cache to accelerate repeated access to the same data.

2.4  The Memory Hierarchy

Every level of the hierarchy exists because building fast, large, cheap memory is impossible — the designer must trade off at least two of those three properties. The hierarchy, from fastest/smallest to slowest/largest: Registers <0.5 ns bytes per core L1 cache ~1 ns 32–128 KB per core L2 cache ~4 ns 256 KB–1 MB per core L3 cache ~20 ns 4–64 MB shared RAM ~100 ns 8–128 GB whole system NVMe SSD ~20,000 ns TB persistent HDD ~5,000,000 ns TB persistent The CPU hardware and OS transparently manage much of the hierarchy through caching and demand paging. But programmer awareness of data locality — keeping frequently accessed data together in memory — can produce 5–20× performance differences.

2.5  I/O Devices and System Buses

The CPU connects to the rest of the system over buses:
  • PCIe (Peripheral Component Interconnect Express): the primary high-speed bus, connecting GPUs, NVMe SSDs, and network cards. PCIe 5.0 provides ~128 GB/s bidirectional bandwidth for a ×16 slot.
  • DDR memory bus: connects CPU to RAM. Separate from PCIe.
  • USB (Universal Serial Bus): connects keyboards, mice, external storage, and peripherals. USB 3.2 Gen 2 reaches ~10 Gb/s.
  • SATA: older interface for HDDs and SATA SSDs (~6 Gb/s).
Device drivers are kernel-mode software that translate the OS’s abstract I/O requests (read block N from device X) into the device-specific commands understood by the hardware. GPU: a massively parallel co-processor with thousands of simpler cores. Excellent for data-parallel workloads (graphics, matrix multiplication, neural network training). Communicates with the CPU over PCIe; data transfers across that bus are expensive relative to GPU-local memory bandwidth (~1–3 TB/s on high-end cards).

2.6  Epilogue

Hardware determines the fundamental cost model of computation. Cache misses, page faults, disk seeks, and PCIe transfers each carry specific latency penalties. The next chapter examines GPUs — the massively parallel co-processors that transformed graphics, AI, and scientific computing.

2.7  References

CPU Cache — Wikipedia
What Every Programmer Should Know About Memory — Drepper
Memory Hierarchy — Wikipedia
PCIe — Wikipedia