How Do GPUs Work?
Graphics Processing Units (GPUs) are highly parallel processors optimized for throughput rather than latency. Unlike CPUs, which are designed for diverse sequential tasks with a few complex cores, GPUs contain thousands of simpler, lightweight cores that excel at executing the same instruction across many data elements. For example, a modern NVIDIA A100 GPU has 6,912 CUDA cores, compared to a CPU with typically 8–64 cores. This massive parallelism makes GPUs indispensable not just for rendering graphics but also for scientific simulations, cryptography, and modern artificial intelligence training.
What Is a GPU?
A GPU is a specialized processor originally built to accelerate image rendering. It handles operations like matrix multiplications, geometric transformations, and pixel shading at massive scale. Unlike a CPU with large caches and branch predictors optimized for logic-heavy control flow, a GPU dedicates most of its silicon to arithmetic units and memory bandwidth. The result is raw compute performance measured in tens of teraflops (trillions of floating-point operations per second). For example, NVIDIA’s H100 delivers over 60 TFLOPS of FP32 performance and more than 1,000 TFLOPS (1 petaflop) in specialized FP8 Tensor Core operations.
Architecture of a GPU
Streaming Multiprocessors and Cores
GPUs are divided into Streaming Multiprocessors (SMs), each containing many cores. NVIDIA GPUs feature CUDA cores for general arithmetic, Tensor Cores for matrix math, and texture units for sampling. An A100 GPU contains 108 SMs, each capable of running thousands of threads simultaneously. Threads are scheduled in groups of 32 called warps. If one warp stalls waiting for memory, the scheduler instantly switches to another warp, ensuring near-constant utilization.
Memory Hierarchy
GPUs emphasize bandwidth and parallel access rather than deep caches:
- Registers: Per-thread storage, operating in the nanosecond range.
- Shared Memory / L1 Cache: ~100 KB per SM in modern architectures, with latency of just a few cycles.
- Global Memory (VRAM): Often 16–80 GB of GDDR6X or HBM2e, with bandwidth up to 2 TB/s. Latency is hundreds of cycles, so kernels must be designed to hide it with parallelism.
- L2 Cache: Several MB shared across the GPU, reducing global memory traffic.
Efficient GPU programming revolves around memory coalescing, minimizing divergence within warps, and maximizing occupancy.
How Do GPUs Process Data?
Parallel Execution Model
Tasks are expressed as kernels that launch thousands of threads. For example, when multiplying two 1,024×1,024 matrices, a GPU might spawn over a million threads, each computing one element of the output. A CPU could execute this task in seconds; a modern GPU can do it in milliseconds.
Shader Programs and Compute Shaders
While traditional shaders (vertex, fragment, geometry) focus on rendering, compute shaders and CUDA kernels allow developers to run arbitrary programs on GPU cores. This makes GPUs flexible accelerators rather than graphics-only devices.
Rendering Pipeline
For graphics, the pipeline includes:
- Vertex Processing – Transforming 3D coordinates (millions of vertices per second).
- Rasterization – Converting primitives into pixels (billions of pixels per second).
- Fragment Processing – Calculating color, lighting, and texture for each pixel.
- Output Merger – Writing the final image to the display buffer.
A GPU like the RTX 4090 can push over 80 billion pixels per second through this pipeline.
GPUs in AI and Machine Learning
Parallel Matrix Math
Training deep neural networks requires enormous numbers of matrix multiplications and convolutions. For instance, training GPT-3 (175 billion parameters) required an estimated 3.14×10^23 FLOPs. GPUs handle this with SIMD execution and dedicated Tensor Cores.
The H100 GPU’s Tensor Cores can achieve 989 TFLOPS of mixed-precision (FP8) throughput, making them the workhorse of large-scale AI training.
Scalability
AI workloads are often distributed across many GPUs. NVIDIA’s NVLink interconnect provides up to 900 GB/s bandwidth between GPUs, while clusters of thousands of GPUs power today’s largest AI models. For example, training GPT-4 reportedly used tens of thousands of GPUs across multiple datacenters.
Memory Considerations
A single large language model can require hundreds of gigabytes of memory. GPUs with 80 GB of HBM2e allow partial storage of these models, while multi-GPU setups and techniques like model parallelism and gradient checkpointing distribute memory needs across clusters.
The Advantages of GPUs
- Throughput-Oriented Design: Thousands of cores deliver tens to hundreds of teraflops of compute.
- Specialized Units: Tensor Cores for deep learning and RT cores for ray tracing accelerate workloads beyond general CUDA cores.
- High Memory Bandwidth: Over 2 TB/s in high-end GPUs, compared to a CPU’s ~100 GB/s.
- Scalability: Multi-GPU systems connected by NVLink or PCIe form the backbone of today’s AI supercomputers.
GPUs achieve their performance by trading general-purpose flexibility for massive parallelism. Their architecture—streaming multiprocessors, warp scheduling, high-bandwidth memory, and specialized compute units—enables both real-time rendering and petaflop-scale AI training.