Basics Story

Chapter #3 – GPUs

architecture, memory, CPU comparison, programming models, use cases

3.0  Prologue

A GPU (Graphics Processing Unit) was designed to render frames — computing the color of millions of pixels simultaneously. That same ability to run thousands of small computations in parallel turned out to be exactly what machine learning and scientific simulation needed. Today GPUs drive graphics, AI training, physics simulation, and data-center workloads alike.

3.1  GPU Architecture

A CPU optimizes for low latency on a small number of independent instruction streams — it uses large caches, out-of-order execution, and branch prediction to finish each task as fast as possible. A GPU takes the opposite trade: it packs thousands of simpler cores onto one chip and hides memory latency by switching among many threads rather than by caching aggressively. Key structural units:
  • Shader cores / CUDA cores / stream processors: the basic arithmetic unit. A high-end GPU may have 10,000–18,000 of them.
  • Streaming Multiprocessor (SM) / Compute Unit (CU): a cluster of 64–128 cores that share instruction fetch, a register file, and shared (fast scratchpad) memory. An NVIDIA H100 has 132 SMs.
  • Warp / wavefront: the smallest schedulable group of threads — typically 32 threads (NVIDIA) or 64 threads (AMD). All threads in a warp execute the same instruction simultaneously (SIMT model).
  • Tensor cores / Matrix cores: dedicated units for 4×4 matrix multiply-accumulate in FP16/BF16/INT8. Introduced for deep learning; an H100 delivers ~3,958 TFLOPS in FP16 with tensor cores.
SIMT execution model: All threads in a warp run the same instruction each clock cycle. When threads diverge at an if/else, both branches execute sequentially with lanes masked off — the inactive lanes do nothing but still consume time. Minimizing branch divergence within a warp is the primary GPU code-quality concern.

3.2  GPU Memory

GPU memory (VRAM) is physically on the graphics card, separate from system RAM. High-bandwidth memory (HBM) stacks DRAM dies directly on the GPU package, achieving bandwidths that CPU DDR cannot match:
Memory type Bandwidth Capacity (typical) Notes
GDDR6X (consumer) ~1 TB/s 12–24 GB NVIDIA RTX 4090
HBM3 (data center) ~3.35 TB/s 80–141 GB NVIDIA H100, AMD MI300X
CPU DDR5 (reference) ~50–80 GB/s up to ~6 TB (server) system RAM
Memory spaces visible to GPU code:
  • Global memory: the main VRAM pool; all threads can read and write it. Highest latency (~500 clock cycles), highest capacity.
  • Shared memory / LDS: fast scratchpad per SM (~96 KB–228 KB). Used for inter-thread cooperation within a thread block. ~20× faster than global memory.
  • Registers: private to each thread; the fastest storage. Spilling registers to global memory is expensive.
  • Constant / texture caches: read-only, broadcast-optimized paths for uniform data.
Data must be transferred from CPU RAM to VRAM before the GPU can process it. The PCIe bus caps that transfer at ~32–64 GB/s — roughly 50× slower than HBM3 bandwidth. Minimizing host–device transfers is as important as minimizing global-memory accesses within the kernel.

3.3  CPU vs. GPU Comparison

Property CPU (e.g., Intel Core Ultra 9) GPU (e.g., NVIDIA H100)
Core count 8–24 powerful cores ~16,896 CUDA cores
Clock speed 3–5 GHz ~1.8 GHz boost
FP32 throughput ~1–3 TFLOPS ~67 TFLOPS
Memory bandwidth ~50–80 GB/s ~3.35 TB/s
Cache per core large (MB per core) small (KB per SM)
Branch prediction deep, speculative minimal; divergence is costly
Best for serial logic, OS, I/O, mixed workloads data-parallel math (ML, graphics, sim)
Neither wins universally. A GPU requires the CPU to manage I/O, orchestrate kernel launches, and handle control flow. In practice, high-performance systems pipeline work: the CPU prepares the next batch of data while the GPU processes the current one.

3.4  Programming Models

CUDA (NVIDIA): the dominant GPU compute platform. Extends C++ with kernel launch syntax (kernel<<<blocks, threads>>>(args)). NVIDIA-only; available since 2007. Most AI frameworks (PyTorch, TensorFlow) target CUDA by default. ROCm / HIP (AMD): AMD’s open-source compute stack. HIP kernels are syntactically close to CUDA; a translation tool (hipify) automates much of the porting. Targets AMD Instinct and Radeon GPUs. OpenCL: an open standard that targets GPUs, CPUs, and FPGAs from any vendor. More portable but lower-level than CUDA; largely displaced by CUDA in the AI space. Metal (Apple): Apple’s GPU API for macOS and iOS, covering both graphics and compute on Apple Silicon. WebGPU / WGSL: a W3C standard exposing GPU compute to web browsers. The shading language WGSL compiles to native GPU instructions. Suitable for in-browser ML inference on the user’s GPU. Higher-level libraries — cuDNN, cuBLAS, CUTLASS (NVIDIA) and rocBLAS (AMD) — provide optimized implementations of matrix operations, convolutions, and attention, so application code rarely writes raw CUDA kernels.

3.5  Use Cases

  • 3D graphics and gaming: rasterization, ray tracing, shading. The original GPU workload; still dominates consumer GPU sales.
  • AI training: large language models, image classifiers, and diffusion models require billions of matrix multiplications. A single training run for a large model may use thousands of H100s for weeks.
  • AI inference: serving a trained model in production. Smaller GPUs or dedicated NPUs (Apple Neural Engine, Qualcomm Hexagon) handle latency-sensitive inference.
  • Scientific simulation: fluid dynamics (CFD), molecular dynamics, finite-element analysis, weather modeling.
  • Cryptocurrency mining: proof-of-work hashing is embarrassingly parallel. GPU supply shortages in 2020–2022 were driven largely by mining demand.
  • Video encoding/decoding: fixed-function hardware blocks on modern GPUs (NVENC, QuickSync, VCE) offload video codec work from the CPU.

3.6  GPU Industry by Region

The GPU value chain has three distinct phases: design (architecting the chip — transistor layout, instruction set, memory subsystem), manufacturing (physically fabricating the die in a semiconductor foundry), and application (deploying GPUs in data centers, workstations, and consumer devices). Each phase has a very different geographic profile. Regions below match those used in Chapter 7 (Data Centers). Consumption percentages are approximate shares of global GPU revenue as of 2025 and shift rapidly with AI-driven demand.
Region Design Manufacturing Consumption (~%)
United States Dominant globally. NVIDIA (Santa Clara) designs all GeForce and data-center GPU lines including the H100 and Blackwell series. AMD (Santa Clara) designs the Radeon and Instinct lines. Qualcomm designs Adreno mobile GPUs. Apple designs the GPU cores inside Apple Silicon (M and A series). Intel designs the Arc discrete GPU and integrated Xe graphics. US-designed GPUs are almost entirely fabricated abroad (see Taiwan, South Korea below). TSMC’s Arizona facility began N4P production in 2024 and will handle a growing slice of Apple and potentially NVIDIA wafers. Intel Foundry Services operates domestic fabs but holds no significant GPU production share. ~38–42%. The largest single market. Hyperscale AI clusters (Microsoft / OpenAI, Google, Meta, Amazon) account for the majority of data-center GPU spend. Consumer gaming and professional workstation markets add a substantial base.
Canada and Mexico Canada hosts notable AI chip startups; Tenstorrent (Toronto) designs RISC-V-based AI accelerators that overlap GPU workloads. No volume GPU brand originates here. No leading-edge GPU fabrication. Some older-node semiconductor packaging and assembly in Mexico for the broader electronics supply chain. ~2–3%. Cloud deployments, enterprise AI pilots, and a significant gaming market. Canadian universities and government labs operate modest HPC GPU clusters.
Central and South America No significant GPU design activity. University research groups in Brazil and Chile contribute to GPU computing research but not to chip architecture. No relevant fabrication. Brazil has a domestic electronics manufacturing sector (Manaus free-trade zone) focused on assembly of imported components. ~1%. Gaming dominates; enterprise and cloud AI adoption is early-stage. Brazil (São Paulo) is the largest sub-regional market.
European Union and UK ARM Holdings (Cambridge, UK) licenses Mali and Immortalis GPU IP cores, which power mobile GPUs inside billions of Android devices and Apple’s early SoCs. Imagination Technologies (Hertfordshire, UK) licenses PowerVR GPU IP used in embedded and automotive chips. EU-based chip startups focus mostly on AI inference accelerators rather than discrete GPUs. No leading-edge GPU fabrication. GlobalFoundries Dresden operates a mature-node fab (12–22 nm) serving automotive and IoT markets. The EU Chips Act (2023) targets 20% of global semiconductor production by 2030 but leading-edge GPU fabrication remains at least a decade away. ~13–15%. Large cloud providers (AWS, Azure, GCP European regions) account for the bulk. HPC GPU clusters at national supercomputing centres (CERN, Jülich, CINECA) contribute significantly. Gaming is a large consumer market. GDPR data-localisation drives regional GPU infrastructure independent of US clusters.
Russia No competitive GPU design. MCST produces the Elbrus CPU/GPU hybrid for domestic defence use; performance is well below contemporary NVIDIA or AMD products. Sanctions since 2022 have cut off access to Western GPU tooling and EDA software, freezing development further. No relevant fabrication. Domestic fabs (Mikron, Angstrem) are limited to 90–250 nm nodes — several generations behind GPU requirements. ~1%. Severely constrained by export controls and sanctions blocking imports of NVIDIA and AMD products. Domestic cloud and AI projects rely on pre-sanction stockpiles and grey-market supply chains.
China The most active emerging GPU design ecosystem outside the US. Key players: Moore Threads (Beijing) — MTT S80/S4000 discrete GPUs; Biren Technology (Shanghai) — BR100 data-center accelerator; Cambricon — AI inference chips; Huawei — Ascend 910B GPU-class accelerator, manufactured at SMIC 7 nm. All remain 1–3 generations behind NVIDIA H100 in training performance and software ecosystem maturity. SMIC (Shanghai) operates the most advanced domestic logic foundry, reaching an effective ~7 nm node (N+2) for select products. US export controls restrict SMIC from receiving EUV lithography equipment, capping future node advancement. CXMT and YMTC produce DRAM and NAND; HBM production (required for data-center GPUs) remains nascent domestically. TSMC and Samsung are prohibited from supplying China with leading-edge GPU production. ~18–22%. Historically the second-largest market, driven by Baidu, Alibaba, Tencent, ByteDance, and a large gaming industry. US export controls introduced in 2022–2023 banned H100 and subsequently A800/H800 exports, redirecting demand to domestic alternatives and creating a significant supply gap for training-grade hardware.
Southeast Asia Singapore hosts R&D centres for NVIDIA, AMD, and Qualcomm but originates no major GPU architecture. No volume GPU design elsewhere in the region. No leading-edge GPU fabrication. Micron and GlobalFoundries operate memory and mature-logic fabs in Singapore. Advanced packaging (CoWoS, OSAT) for GPU chiplets is beginning to expand in Malaysia (Penang) through Intel, Infineon, and others. ~4–5%. Growing cloud infrastructure (AWS, Azure, GCP Singapore regions), gaming, and AI adoption across the region. Singapore is the primary data-center hub.
India and Australia India has large design centres for NVIDIA, AMD, Qualcomm, and Intel in Hyderabad and Bangalore, contributing to GPU architecture and verification — but as engineering offices of US companies, not independent GPU brands. No Australian GPU design of note. India’s semiconductor ambitions are materialising slowly; Tata Electronics and PSMC announced a 28 nm fab (Gujarat, ~2026) aimed at automotive and mobile chips, not GPU-class leading-edge nodes. No relevant Australian fabrication. ~4–5%. India is growing rapidly — Jio, Infosys AI, and government HPC initiatives are driving GPU procurement. Australia serves as a Pacific-region cloud hub (AWS Sydney, Azure Melbourne) with strict data-sovereignty requirements.
Rest of world Taiwan: MediaTek designs mobile GPU IP (using ARM Mali cores); TSMC itself does not design GPUs. South Korea: Samsung designs the Xclipse GPU (AMD RDNA2 IP licensed) for Exynos SoCs. Japan: Sony collaborates with AMD on the PlayStation GPU architecture. Middle East / Japan / South Korea have emerging or niche GPU design activity but no globally competitive discrete GPU brands. Taiwan (TSMC) is the single most critical node in global GPU manufacturing: NVIDIA, AMD, and Apple GPU dies are fabricated at TSMC (N4P, N3 nodes) in Hsinchu and Taichung. TSMC holds roughly 90% of the world’s leading-edge GPU foundry capacity. South Korea (Samsung Foundry) fabricates AMD RDNA-generation GPUs and some Apple chips; Samsung and SK Hynix produce essentially all HBM2e and HBM3 memory stacked on data-center GPUs. Japan: TSMC’s Kumamoto fab (JASM, opened 2024) serves mature-node demand; Rapidus targets 2 nm by 2027 but not GPU production. ~8–10%. Japan operates large national HPC GPU clusters (ABCI, Fugaku follow-on). South Korea has a significant gaming and enterprise AI market. Middle East (UAE, Saudi Arabia) is the fastest-growing sub-region, with multi-billion-dollar AI data center investments targeting thousands of H100-class GPUs. Taiwan consumes GPUs domestically for gaming and industrial AI but exports most of what it manufactures.
Key structural observations:
  • GPU design is highly concentrated — the US (NVIDIA, AMD, Qualcomm, Apple, Intel) and UK (ARM) account for essentially all commercially significant GPU architectures. China is the only region making a determined effort to close this gap.
  • GPU manufacturing is even more concentrated: TSMC in Taiwan fabricates the vast majority of leading-edge GPU dies. A disruption to TSMC — earthquake, energy shortage, or geopolitical crisis — would halt global GPU supply within weeks. This dependency is the central concern of US, EU, and Japanese semiconductor policy.
  • GPU consumption is more distributed, mirroring data-center investment patterns but skewed toward the US and China. Export controls targeting China have bifurcated the market and are accelerating Chinese domestic GPU development.
  • HBM (High-Bandwidth Memory) is a separate but equally critical dependency: Samsung and SK Hynix (both South Korean) supply essentially all HBM used in data-center GPUs worldwide. A shortage in HBM production constrains GPU availability as directly as a wafer shortage at TSMC.

3.7  Epilogue

GPUs redefine the performance ceiling for data-parallel workloads. Their value comes from matching the problem structure: uniform operations over large datasets with minimal branching and high arithmetic intensity. When that match holds, a single GPU replaces dozens of CPU cores; when it does not, the CPU remains the right tool.

3.8  References

GPU — Wikipedia
CUDA C++ Programming Guide — NVIDIA
ROCm Documentation — AMD
SIMT — Wikipedia
Understanding GPU Memory — GPUOpen