| Aspect |
CPU |
GPU |
| Core count |
4–128 large cores |
Thousands of small cores |
| Execution model |
MIMD — independent threads |
SIMT — warps of 32 threads in lockstep |
| Latency per thread |
Low (∼1–4 ns/op) |
Higher; hidden by warp switching |
| Peak throughput |
Moderate (∼1–5 TFLOPS FP32) |
Very high (60–150+ TFLOPS FP32) |
| Control flow |
Complex branching is efficient |
Divergent branches serialize; keep uniform |
| Memory model |
Coherent shared memory; caches automatic |
Separate VRAM; shared memory is explicit scratchpad |
| Data movement |
None (data lives in system RAM) |
CPU↔GPU transfers over PCIe (costly) |
| Programming model |
Threads, async/await, locks |
Kernel launches, warps, shared memory management |
| Best for |
Irregular, latency-sensitive, I/O-bound work |
Regular, data-parallel, compute-bound work |