Basics Story

Chapter #4 – Software & Operating Systems

OS role, kernel, processes, threads, virtual memory, file systems

4.0  Prologue

An operating system is the layer between hardware and user applications. It provides the illusion of exclusive hardware access, manages shared resources safely, and enforces isolation between processes. Without an OS, every program would need to directly manage the display, disk, keyboard, and network — and could crash any other running program.

4.1  What the OS Provides

  • Process management: creates, schedules, and terminates processes; enforces isolation between them.
  • Memory management: gives each process a private virtual address space; maps virtual to physical pages.
  • File system: persistent named storage, directory hierarchy, permissions.
  • I/O management: device drivers, buffering, and a uniform read/write API regardless of device type.
  • Networking: TCP/IP stack, socket API.
  • Security: user accounts, access permissions, process isolation.
  • System calls: the controlled gateway that lets user-mode code request kernel services.

4.2  Kernel Space vs. User Space

Modern CPUs provide (at least) two protection rings:
  • Kernel mode (ring 0): unrestricted hardware access. OS kernel code runs here. A bug here can crash or corrupt the entire system.
  • User mode (ring 3): restricted. Applications run here. Hardware access is mediated through system calls — software interrupts that transfer control to the kernel.
A system call (e.g., read(), write(), open() on Linux) crosses from user space to kernel space, performs the requested operation, then returns. The crossing costs ~100–500 ns, which is why applications use buffered I/O — accumulating data and flushing in large chunks rather than calling write() for every byte.

4.3  Processes

A process is a running instance of a program. Each process has:
  • An isolated virtual address space (code, heap, stack, mapped files)
  • One or more threads of execution
  • File descriptors (open files, sockets, pipes)
  • A PID (process identifier) assigned by the OS
  • Environment variables and command-line arguments
Isolation is enforced by the hardware MMU: one process cannot read or write another process’s memory without explicit sharing (shared memory or IPC). A crash in one process does not take down others. Process creation:
  • POSIX (Linux/macOS): fork() clones the calling process; the child typically calls exec() to replace its image with a new program.
  • Windows: CreateProcess() creates a new process from an executable path, optionally inheriting handles.

4.4  Threads

A thread is a unit of execution within a process. All threads in a process share:
  • The virtual address space (heap, globals, code)
  • File descriptors
Each thread has its own:
  • Stack (typically 1–8 MB)
  • CPU registers (including the instruction pointer)
  • Thread-local storage (TLS)
Context switching between threads in the same process is faster than between processes (no address-space switch, no TLB flush) but still costs ~1–5 µs. Shared address space enables communication without copying but also introduces the possibility of data races (see Chapter 6). The OS scheduler decides which thread runs on which core and for how long. Most schedulers use time slices of 1–10 ms on desktop systems.

4.5  Virtual Memory

The OS gives each process the illusion of a private, contiguous address space. On a 64-bit system the virtual address space is typically 48 bits (256 TB), though physical RAM may be only 16 GB. The CPU’s Memory Management Unit (MMU) translates virtual to physical addresses using page tables maintained by the OS. Key concepts:
  • Page: the granularity of virtual-physical mapping, typically 4 KB.
  • Page fault: accessing a virtual page that has no physical mapping. The OS handler either allocates a physical frame (minor fault) or reads the page from disk/swap (major fault, ~1–10 ms).
  • Demand paging: pages are loaded only when first accessed, not when the process starts.
  • Copy-on-write (CoW): after fork(), parent and child share pages marked read-only; a write by either side triggers a copy — only then does each get its own page.
  • Swap: infrequently used pages are written to disk and reclaimed. Swap access is ∼10,000× slower than RAM.
Typical virtual address space layout (Linux x86-64): High [kernel space — not accessible from user mode] [stack — grows downward] [memory maps — mmap, shared libraries] [heap — grows upward via brk/mmap] [BSS — uninitialised globals] [data — initialised globals] Low [text (code) — read-only program instructions]

4.6  File Systems

A file system organises persistent data into named files and directories. The kernel’s Virtual File System (VFS) layer provides a uniform API to applications regardless of the underlying implementation. Common file systems:
  • ext4 (Linux): journaled, reliable, default on most Linux distributions.
  • NTFS (Windows): supports large files, access-control lists, compression, and journaling.
  • APFS (macOS/iOS): copy-on-write, snapshots, strong SSD optimisation.
  • FAT32 / exFAT: simple, portable; used on SD cards and USB drives for cross-platform compatibility.
  • ZFS / Btrfs: advanced file systems with snapshots, checksums, and pooled storage.
Key abstractions: files (byte streams with a name and metadata), directories (hierarchical namespaces), inodes (per-file metadata record: size, permissions, timestamps, block pointers), and the path hierarchy rooted at / (Linux/macOS) or a drive letter (Windows).

4.7  Major Platforms

Platform Kernel Primary API Main use
Windows NT kernel Win32 / Win64, WinRT Desktop, enterprise
Linux Monolithic + modules POSIX / glibc Servers, embedded, Android
macOS XNU (hybrid) POSIX + Cocoa / Swift APIs Apple desktops, iOS base
Linux dominates servers and cloud; Windows dominates enterprise desktops and gaming; macOS is the platform of choice for iOS/macOS development. All three offer a POSIX-compatible (or POSIX-like) C API, so much system-level code is portable with modest effort.

4.8  CPU vs. GPU Programming Models

CPUs and GPUs solve different problems. A CPU is a latency engine — it runs a small number of independent instruction streams as fast as possible. A GPU is a throughput engine — it runs hundreds of thousands of lightweight threads concurrently, sacrificing per-thread speed for aggregate data throughput. Understanding this distinction determines which hardware to target and how to structure code for each.

CPU programming model

A modern desktop CPU has 4–32 large, deeply pipelined cores. Each core has out-of-order execution, branch prediction, and large private caches (L1/L2 up to several MB). A single core can run an independent instruction stream (a thread) at 2–5 GHz, finishing one complex operation per cycle on average. Threading model (MIMD): each thread follows its own instruction sequence with its own data — Multiple Instruction, Multiple Data. Threads are created by the OS and scheduled independently. The programmer explicitly launches threads and coordinates them with mutexes, condition variables, or higher-level abstractions (thread pools, async/await, channels). Memory model: all CPU threads share a single coherent address space. The hardware cache-coherency protocol (MESI or similar) ensures that a write by one core is eventually visible to all others. This makes sharing data simple but requires synchronization primitives to prevent data races. Suited for:
  • Complex, irregular control flow (deep call graphs, recursive algorithms)
  • Latency-sensitive work (web request handling, UI, real-time systems)
  • Tasks with many dependencies or frequent branching
  • I/O-bound workloads (network, file, database)
  • Any problem where parallelism is limited (<64 independent tasks)
APIs and languages: OS threads (pthreads on POSIX, Win32 threads), standard library threading (std::thread in C++/Rust, Thread/Task in C#), async runtimes (Tokio, asyncio, .NET Task Parallel Library), and OpenMP for data-parallel loops on CPU.

GPU programming model

A modern GPU has thousands of small shader cores grouped into streaming multiprocessors (NVIDIA) or compute units (AMD). Each core is simpler than a CPU core — smaller caches, no out-of-order execution, no branch prediction — but there are so many that aggregate throughput dwarfs a CPU for the right workloads. A current high-end GPU can perform over 100 TFLOPS of 16-bit floating-point math per second. Threading model (SIMT): GPU threads are grouped into warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads). All threads in a warp execute the same instruction simultaneously — Single Instruction, Multiple Threads. Warps are grouped into thread blocks (or workgroups); blocks are dispatched across the GPU’s multiprocessors. The full collection of blocks for one kernel launch is called a grid. Warp divergence: when threads in a warp take different branches (if/else), the hardware serializes both paths — threads on the inactive branch are masked off. This can cut throughput in half or worse. Writing GPU-friendly code means minimizing control-flow divergence within a warp. Memory hierarchy (discrete GPU):
  • Registers: per-thread, fastest, limited count (∼256 per thread).
  • Shared memory / L1 cache: per-block, programmer-managed scratchpad, very fast (∼20 ns). Used to stage data for reuse within a block.
  • L2 cache: shared across the GPU, larger but slower.
  • Global DRAM (VRAM / HBM): hundreds of GB/s bandwidth, but ∼200–600 ns latency. Most kernel time is spent waiting for VRAM if access patterns are not coalesced.
  • CPU RAM: reached via PCIe; ∼16–64 GB/s — an order of magnitude slower than VRAM. Minimising PCIe transfers is critical.
Coalesced memory access: when adjacent threads in a warp read or write adjacent memory addresses, the hardware combines the requests into one wide transaction (coalesced). Scattered access (each thread reading a random address) issues one transaction per thread — 32× the memory traffic. Memory layout is therefore a first-class design concern for GPU code. Suited for:
  • Dense linear algebra: matrix multiplication, convolutions (deep learning)
  • Image and video processing: each pixel or block processed identically
  • Physics simulation: particle systems, fluid dynamics, finite-element meshes
  • Signal processing: FFTs, filtering on large arrays
  • Any problem expressible as “apply the same operation to millions of data items”
APIs and frameworks:
  • CUDA (NVIDIA): the most mature and widely used GPU compute API. Extends C/C++ with kernel launch syntax (kernel<<<blocks, threads>>>(args)) and provides cuBLAS, cuDNN, and other libraries.
  • ROCm / HIP (AMD): CUDA-like API portable between AMD and NVIDIA hardware.
  • OpenCL: open standard, portable across CPUs, GPUs, and FPGAs; more verbose than CUDA.
  • Metal Shading Language (Apple): used for both graphics and GPU compute on Apple Silicon; benefits from unified CPU/GPU memory.
  • WebGPU: browser-accessible GPU compute and graphics; uses WGSL shading language.
  • PyTorch / TensorFlow / JAX: high-level ML frameworks that generate GPU kernels automatically; most practitioners never write raw CUDA.

Side-by-side comparison

Aspect CPU GPU
Core count 4–128 large cores Thousands of small cores
Execution model MIMD — independent threads SIMT — warps of 32 threads in lockstep
Latency per thread Low (∼1–4 ns/op) Higher; hidden by warp switching
Peak throughput Moderate (∼1–5 TFLOPS FP32) Very high (60–150+ TFLOPS FP32)
Control flow Complex branching is efficient Divergent branches serialize; keep uniform
Memory model Coherent shared memory; caches automatic Separate VRAM; shared memory is explicit scratchpad
Data movement None (data lives in system RAM) CPU↔GPU transfers over PCIe (costly)
Programming model Threads, async/await, locks Kernel launches, warps, shared memory management
Best for Irregular, latency-sensitive, I/O-bound work Regular, data-parallel, compute-bound work

Hybrid workloads

Most GPU-accelerated applications are hybrid: the CPU handles orchestration, I/O, and irregular logic while the GPU runs compute kernels on large data sets. A typical deep learning training loop runs entirely on the CPU (data loading, optimizer bookkeeping, logging) except for the forward and backward passes, which execute as a sequence of GPU kernel calls. Keeping the GPU occupied — avoiding long CPU stalls between kernel launches — is the primary performance concern. Unified memory (Apple Silicon, CUDA managed memory, AMD APUs) places CPU and GPU in the same physical address space, eliminating explicit PCIe transfers. This simplifies programming significantly and enables new classes of fine-grained CPU/GPU cooperation, at the cost of bandwidth sharing.

4.9  Epilogue

The operating system builds a safe, managed environment on top of the hardware described in Chapter 1. Applications live entirely in this managed environment, speaking to the hardware only through system calls. The next chapter covers the networking infrastructure that connects systems to the wider world.

4.10  References

Operating Systems: Three Easy Pieces (free online book)
Linux Kernel Documentation
Virtual Memory — Wikipedia
Process — Wikipedia