Key Compute Units in NVIDIA GPUs | Generated by AI

Home 2025.12

Question: Provide a comprehensive introduction to compute units in NVIDIA GPUs: Tensor Cores, CUDA cores, load/store units, and warp schedulers.

Answer:

NVIDIA GPUs are designed for massive parallel processing, particularly in graphics, scientific computing, and AI workloads. The core building block is the Streaming Multiprocessor (SM), which contains various compute units responsible for executing instructions. These units work together under the SIMT (Single Instruction, Multiple Thread) model, where threads are grouped into warps of 32 threads that execute the same instruction simultaneously.

CUDA Cores

CUDA cores (also called Streaming Processors or SPs) are the general-purpose arithmetic units in the SM. They handle scalar operations, including:

Each CUDA core executes one operation per thread per clock cycle (or in some cases, shared across cycles). Modern SMs (e.g., in Ampere or later architectures) typically have 64–128 CUDA cores per SM, divided into FP32 and INT32 paths. They are versatile and used for most non-specialized computations, such as general math in games, simulations, or non-matrix AI operations.

CUDA cores follow the SIMT model: all 32 threads in a warp execute the same instruction, but on different data. If threads diverge (e.g., via branches), inactive threads are masked off, reducing efficiency.

Tensor Cores

Introduced in the Volta architecture (2017) and improved in subsequent generations (Turing, Ampere, Hopper, Blackwell), Tensor Cores are specialized accelerators for matrix multiply-accumulate (MMA) operations, which are fundamental to deep learning (e.g., neural network training and inference).

Key features:

A single Tensor Core can deliver hundreds to thousands of operations per cycle, far exceeding CUDA cores for matrix workloads. They are programmed via warp-level primitives (e.g., WMMA or MMA instructions in CUDA). Tensor Cores dominate AI performance but are limited to specific operations; general tasks fall back to CUDA cores.

Load/Store Units (LD/ST)

Load/Store units handle memory operations: loading data from memory into registers and storing results back. They are essential for accessing global memory, shared memory, constant memory, etc.

Characteristics:

Efficient use requires coalesced accesses (threads accessing contiguous memory) to maximize bandwidth and avoid serialization. LD/ST units are a common bottleneck in memory-bound kernels.

Warp Schedulers

Warp schedulers manage the execution of warps within an SM. They are critical for hiding latency and maximizing throughput.

Role and operation:

Scheduling policies (e.g., round-robin or priority-based) ensure fairness and utilization. High warp occupancy (many resident warps per SM) improves performance by providing more options for scheduling.

How They Interact in an SM

A typical modern SM (e.g., Ampere) is partitioned into 4 processing blocks, each with:

Instructions from selected warps are dispatched to the relevant units. This design allows concurrent execution of different instruction types and efficient resource use.

In summary, CUDA cores provide general compute, Tensor Cores accelerate AI-specific matrix ops, LD/ST units manage memory traffic, and warp schedulers orchestrate everything to hide latencies and achieve high parallelism. Understanding these units helps optimize CUDA kernels for specific workloads.

References:


Back

x-ai/grok-4.1-fast

Donate