BasicsStory_CPUs.html
copyright © James Fawcett
Revised: 05/15/2026
2.0 Prologue
A computer is a machine that fetches instructions from memory, decodes them, and
executes them. Understanding the physical hardware helps explain why software behaves
the way it does — why some operations are fast and others are slow, why memory
layout matters, and why device drivers exist.
2.1 The Central Processing Unit (CPU)
The CPU is the brain of the computer. It executes a continuous fetch–decode–execute
cycle: read the next instruction from memory, determine what it does, carry it out, advance
the instruction pointer.
Key CPU properties:
- Clock speed (GHz): how many clock ticks per second; one instruction
often completes in one to several ticks.
- Core count: the number of independent execution units on the same chip.
A 12-core CPU can execute 12 instruction streams simultaneously.
- Hyper-threading / SMT: each physical core presents two logical cores
to the OS, sharing execution units but not stalling during memory waits.
- Out-of-order execution: the CPU reorders instructions internally
to avoid stalls caused by slow memory reads.
- Speculative execution: the CPU guesses the outcome of a branch and
begins executing the predicted path before the branch is resolved.
- SIMD (Single Instruction, Multiple Data): one instruction applies
to a vector of values (e.g., add 8 floats at once with AVX).
CPU cache hierarchy:
| Level |
Typical size |
Latency |
Scope |
| Registers |
bytes |
<0.5 ns |
per core |
| L1 cache |
32–128 KB |
~1 ns |
per core |
| L2 cache |
256 KB–1 MB |
~4 ns |
per core |
| L3 cache |
4–64 MB |
~20 ns |
shared across cores |
| RAM |
8–128 GB |
~60–100 ns |
whole system |
Code implication: data that fits in L1 or L2 cache runs 20–100× faster than
data that must come from RAM. Iterating over a contiguous array is cache-friendly;
chasing pointers through a linked list is not.
2.2 Main Memory (RAM)
RAM (Random Access Memory) is byte-addressable, volatile (contents lost at power-off),
and much slower than cache. Typical modern systems have 8–128 GB of DRAM
(Dynamic RAM).
Characteristics:
- Access latency: ~60–100 ns, roughly 60–200 clock cycles
on a modern CPU.
- Bandwidth: DDR5 provides ~50–80 GB/s of aggregate bandwidth
per memory channel.
- NUMA (Non-Uniform Memory Access): on multi-socket servers, RAM
attached to socket 0 is faster for socket 0’s cores than for socket 1’s
cores. NUMA-unaware code can suffer a 2–4× penalty.
Every variable your program declares ultimately lives in RAM when it is not in a register
or cache. The OS virtual memory system decides which pages are resident in RAM
and which are paged out to disk.
2.3 Storage
Storage devices hold persistent data. Technology and access patterns differ widely:
| Technology |
Capacity |
Random Read Latency |
Sequential Bandwidth |
| HDD (spinning disk) |
1–20 TB |
~5 ms |
~150 MB/s |
| SATA SSD |
0.5–4 TB |
~0.1 ms |
~550 MB/s |
| NVMe SSD (PCIe 4) |
0.5–8 TB |
~0.02 ms |
~7 GB/s |
| RAM (reference) |
8–128 GB |
~0.0001 ms |
~50 GB/s |
Applications access storage through the OS file system API (open, read, write, close).
The OS translates these to block-device operations, applies buffering, and manages
the page cache to accelerate repeated access to the same data.
2.4 The Memory Hierarchy
Every level of the hierarchy exists because building fast, large, cheap memory is
impossible — the designer must trade off at least two of those three properties.
The hierarchy, from fastest/smallest to slowest/largest:
Registers <0.5 ns bytes per core
L1 cache ~1 ns 32–128 KB per core
L2 cache ~4 ns 256 KB–1 MB per core
L3 cache ~20 ns 4–64 MB shared
RAM ~100 ns 8–128 GB whole system
NVMe SSD ~20,000 ns TB persistent
HDD ~5,000,000 ns TB persistent
The CPU hardware and OS transparently manage much of the hierarchy through caching
and demand paging. But programmer awareness of data locality
— keeping frequently accessed data together in memory — can produce
5–20× performance differences.
2.5 I/O Devices and System Buses
The CPU connects to the rest of the system over buses:
- PCIe (Peripheral Component Interconnect Express): the primary
high-speed bus, connecting GPUs, NVMe SSDs, and network cards. PCIe 5.0 provides
~128 GB/s bidirectional bandwidth for a ×16 slot.
- DDR memory bus: connects CPU to RAM. Separate from PCIe.
- USB (Universal Serial Bus): connects keyboards, mice, external
storage, and peripherals. USB 3.2 Gen 2 reaches ~10 Gb/s.
- SATA: older interface for HDDs and SATA SSDs (~6 Gb/s).
Device drivers are kernel-mode software that translate the OS’s
abstract I/O requests (read block N from device X) into the device-specific commands
understood by the hardware.
GPU: a massively parallel co-processor with thousands of simpler
cores. Excellent for data-parallel workloads (graphics, matrix multiplication, neural
network training). Communicates with the CPU over PCIe; data transfers across that bus
are expensive relative to GPU-local memory bandwidth (~1–3 TB/s on high-end cards).
2.6 Epilogue
Hardware determines the fundamental cost model of computation. Cache misses,
page faults, disk seeks, and PCIe transfers each carry specific latency penalties.
The next chapter examines GPUs — the massively parallel co-processors that
transformed graphics, AI, and scientific computing.
2.7 References
CPU Cache — Wikipedia
What Every Programmer Should Know About Memory — Drepper
Memory Hierarchy — Wikipedia
PCIe — Wikipedia