Basics Threads

A thread is the unit of execution that the operating system schedules on a processor core. Every process starts with one thread — the primary thread — and may create additional threads to perform work concurrently. All threads within a process share its address space, open file handles, and other resources, but each thread has its own stack, register set, and program counter. This shared-memory model enables efficient communication between threads but introduces the possibility of data races and requires explicit synchronization.

1. Thread Basics

Threads vs. processes. Creating a new process duplicates the parent's page tables, file descriptor table, and kernel bookkeeping — an expensive operation. Creating a thread within an existing process is much cheaper: it allocates a new stack (typically 1–8 MB reserved, a few pages committed) and a kernel thread object, but shares everything else. This is why servers favor thread pools or async I/O over spawning a new process per request.

The thread stack. Each thread gets its own call stack. Local variables, function parameters, and return addresses for that thread's call chain all live there. Stack size is fixed at thread creation; overflowing it causes a stack overflow exception or a segmentation fault. Heap memory, global variables, and static data are shared among all threads in the process.

Hardware threads and logical processors. A physical core may expose two hardware threads via simultaneous multithreading (SMT, marketed as Hyper-Threading on Intel). The OS scheduler sees each hardware thread as a logical processor. Software threads are multiplexed onto logical processors by the scheduler; a machine with 8 logical processors can run 8 threads truly in parallel at any instant.

2. Thread Lifecycle

A thread moves through a sequence of states from creation to termination:

New: the thread object has been created but the OS thread has not yet started executing (e.g., a newly constructed std::thread before its entry function runs).
Runnable: the thread is ready to execute and waiting for the scheduler to assign it a processor. It is in the run queue.
Running: the thread is actively executing on a logical processor. The scheduler may preempt it at any time — saving its register state and placing it back in the run queue — to let another thread run.
Blocked / Waiting: the thread is not runnable because it is waiting for an event: an I/O completion, a mutex lock, a condition variable signal, a sleep timer, or a join on another thread. The scheduler does not waste CPU cycles on a blocked thread.
Terminated: the thread's entry function has returned or the thread was explicitly killed. Resources (stack, kernel object) are reclaimed when the thread is joined or detached.

Join and detach. A thread that was created must be either joined or detached before the owning object is destroyed. Joining blocks the calling thread until the target terminates and reclaims its resources. Detaching lets the thread run independently; its resources are reclaimed automatically on termination, but the caller cannot observe its result. Destroying a joinable thread handle without joining or detaching is undefined behavior in C++ and a panic in Rust.

3. Synchronization

Shared mutable state is the root cause of most threading bugs. Two threads reading the same memory location concurrently is safe; any concurrent access where at least one thread writes is a data race and produces undefined behavior in C++ and a compile error in Rust. Synchronization primitives serialize access so that only one thread at a time operates on shared data.

Common primitives:

Primitive	Purpose
Mutex	Mutual exclusion lock: only one thread may hold it at a time. Other threads block on lock() until the holder calls unlock().
Recursive mutex	Like a mutex but the same thread may acquire it multiple times without deadlocking; must release it the same number of times.
Read/write mutex	Allows many concurrent readers or exactly one writer. Improves throughput when reads heavily outnumber writes.
Semaphore	A counter that allows up to N threads to proceed simultaneously. Used to limit concurrency (e.g., a pool of N database connections).
Condition variable	Lets a thread sleep until a predicate becomes true. Always used with a mutex: the thread atomically releases the mutex and sleeps; it reacquires the mutex before returning from wait().
Spinlock	Busy-waits in a loop rather than yielding to the scheduler. Efficient only when the wait is expected to be very short (microseconds); wastes CPU on longer waits.
Atomic operation	Hardware-guaranteed indivisible read-modify-write on a single word. Lock-free; used for flags, counters, and reference counts.
Barrier / latch	Blocks a group of threads until all have reached the barrier, then releases them together. Useful for phased parallel algorithms.

Deadlock. Deadlock occurs when two or more threads each hold a resource the other needs and none can proceed. The four necessary conditions are:

Mutual exclusion: at least one resource is non-shareable.
Hold and wait: a thread holds one resource while waiting for another.
No preemption: resources are released only voluntarily.
Circular wait: a cycle exists in the resource-dependency graph.

Breaking any one condition prevents deadlock. The most practical strategies are consistent lock ordering (always acquire mutexes in the same global order) and timed lock attempts with backoff.

Memory ordering. Modern CPUs and compilers reorder instructions for performance. Atomic operations carry a memory order parameter (seq_cst, acquire, release, relaxed) that constrains reordering. Incorrect memory ordering produces subtle, hardware-dependent bugs that only appear on multi-core systems under specific timing conditions.

4. Thread Pools

Creating and destroying OS threads for every unit of work is expensive: each thread requires stack allocation, a kernel object, and scheduler registration. A thread pool pre-creates a fixed number of worker threads that pull tasks from a shared work queue, amortizing creation cost over many tasks.

Work queue. Tasks (closures, function pointers, or futures) are enqueued by producers. Worker threads dequeue and execute them. The queue is protected by a mutex or is a lock-free structure; a condition variable wakes idle workers when work arrives.

Work stealing. Each worker maintains its own local deque of tasks. When a worker's queue is empty it steals tasks from the back of another worker's deque. This reduces contention on a central queue and improves cache locality. Rust's tokio and the .NET ThreadPool both use work-stealing schedulers.

Sizing the pool. CPU-bound work typically uses one thread per logical processor. I/O-bound work can use more threads because most are blocked waiting at any given time; the optimal count depends on the I/O latency and throughput requirements. Oversizing the pool wastes memory (each thread has a stack) and increases scheduler overhead.

5. Language Support

Language	Thread type	Key synchronization and safety
Rust	std::thread::spawn; returns a JoinHandle<T>	Mutex<T>, RwLock<T>, Condvar, Arc<T> for shared ownership; Send and Sync traits enforce data-race freedom at compile time — sharing non-Send types across threads is a compile error
C++	std::thread (C++11), std::jthread (C++20, auto-joins)	std::mutex, std::shared_mutex, std::condition_variable, std::atomic<T>; no compile-time data-race prevention — correctness is the programmer's responsibility
C#	System.Threading.Thread; ThreadPool; Task (preferred)	lock statement (Monitor), Mutex, SemaphoreSlim, ReaderWriterLockSlim, Interlocked for atomics; no compile-time race detection
Python	threading.Thread	threading.Lock, RLock, Condition, Semaphore; the GIL serializes bytecode execution in CPython, preventing true parallel CPU work on multiple threads — use multiprocessing or concurrent.futures.ProcessPoolExecutor for CPU-bound parallelism

For Rust, you will find more details with examples in ../Rust/RustBites_Threads.html. Eventually details with examples will arrive for C++, C#, and Python.

6. Repository Support

Language	Repositories	Interface
Rust	RustThreadPool RustBlockingQueue	ThreadPool<M>: new(nt, f), post_message(), get_message(), wait(), shut_down() BlockingQueue<T>: new(), en_q(), de_q(), len()
C++	ThreadPool CppBlockingQueue	ThreadPool<W,N>: N threads dequeue and execute callable workitems W; Task wraps a static pool instance for fire-and-forget use BlockingQueue<T>: enQ(), deQ() (blocks on empty); std::mutex + std::condition_variable internals
C#	CsBlockingQueue	BlockingQueue<T>: blocks dequeuer when empty; Monitor (condition variable + lock) internals; moveable, not copyable ThreadPool: System.Threading.ThreadPool (built-in)
Python	none yet

7. Consequences

Threads enable a program to use multiple processor cores and to overlap I/O latency with computation, but they introduce failure modes that do not exist in single-threaded code:

Data races produce non-deterministic results that change with timing, compiler optimization level, and hardware architecture. They are among the hardest bugs to reproduce and diagnose.
Deadlock freezes the program silently; no exception is thrown. Detection requires thread-aware debuggers or watchdog timeouts.
Priority inversion occurs when a high-priority thread is blocked on a mutex held by a low-priority thread that is itself preempted. Real-time systems address this with priority inheritance protocols.
False sharing degrades performance when two threads write to different variables that happen to occupy the same cache line, causing the cache line to bounce between cores. Padding structs to cache-line boundaries eliminates this.

Where possible, prefer designs that minimize shared mutable state: immutable data needs no synchronization, and message passing (channels, queues) confines mutation to one owner at a time. Rust enforces this structurally; in other languages it is a discipline.

Basic Bites: Threads

Creation, lifecycle, synchronization, thread pools