Memory Bottleneck in Large Language Models

Memory Bottleneck in Large Language Models | Generated by AI

Home 2026.06

Question: Why is memory (and unified memory) so critical for LLM training and inference?

Answer:

The fundamental reason is compute-to-memory bandwidth mismatch. Modern GPUs/accelerators have vastly more compute than memory bandwidth, so the bottleneck during LLM training/inference is almost always memory movement, not math.

The Core Problem

A GPU like RTX 4070 or MI300X can perform ~teraFLOPS of compute, but memory bandwidth is limited:

Device	Peak Compute	Memory BW	Compute:BW Ratio
RTX 4070	~29 TF32-TFLOPS	432 GB/s	~67
MI300X	~360 TFLOPS (FP8)	5.3 TB/s	~68

To keep the GPU busy, you need to move data in and out of memory fast enough to feed the compute units.

If you have 67 bytes of data per FLOP, and compute takes 1 nanosecond per FLOP, you need 67 bytes * 1 ns = 67 ns of memory latency tolerance. GDDR6 can’t sustain that. HBM (High Bandwidth Memory) exists specifically to solve this: HBM3 on MI300X gives you 5.3 TB/s vs GDDR6’s ~400 GB/s—10x more bandwidth.

Training: Activation Memory Explodes

During forward pass, you must keep all intermediate activations in memory for the backward pass. For a transformer:

Activation memory = batch_size × seq_len × hidden_dim × num_layers × bytes_per_param

Example (13B model, bsz=1, seq=4096):
= 1 × 4096 × 5120 × 40 × 2 bytes (fp16)
≈ 1.6 GB just for activations, per layer
× 40 layers ≈ 64 GB

This is linear in sequence length. Longer context = more memory needed. This is why:

Flash Attention minimizes this by fusing kernels (no intermediate storage)
Gradient checkpointing trades compute for memory (recompute activations, don’t store them)
You need HBM or large VRAM just to fit standard training

Inference: KV Cache Dominates

During inference, you cache key/value vectors to avoid recomputing attention:

KV cache size = batch_size × seq_len × hidden_dim × 2 × num_layers × bytes

For 13B model, bsz=32, seq=4096:
= 32 × 4096 × 5120 × 2 × 40 × 2 bytes (fp16)
≈ 52 GB

This scales quadratically with sequence length and batch size. If you want to serve 128 concurrent users with 8K context, you need massive memory. This is why:

Quantization (int8/fp8) reduces KV cache size by 50-75%
MQA (Multi-Query Attention) reduces KV cache by num_heads factor
You’re memory-bound, not compute-bound, during inference

Why Unified Memory Matters

NVIDIA’s Unified Memory Architecture (UVA) allows CPU and GPU to access the same virtual address space, with automatic paging between them. This sounds great but has a critical flaw:

Paging is slow. If activations spill to system RAM, you’re moving data over PCIe (16 GB/s on PCIe 4.0) instead of HBM (5.3 TB/s). You lose 300x bandwidth.

However, unified memory is useful for:

Grad accumulation workflows where you move data explicitly with fine control
CPU-GPU heterogeneous compute (rare in LLM training)
Dynamic memory management where you don’t know sizes ahead of time

For pure LLM training, you’re better off with explicit memory management (allocate on GPU, keep it there) than relying on unified memory’s automatic paging.

Practical For Your Situation

MI300X (192 GB HBM3): You can train larger models in full precision and fit larger batches because of the huge bandwidth and capacity. This is why it’s so strong for 760M GPT-2 training—you have headroom.

RTX 4070 (12 GB VRAM): You’re constrained. For serious work, you need:

Mixed precision (fp16/tf32)
Gradient checkpointing
Smaller batch sizes or sequence lengths
Quantization (QLoRA for fine-tuning)

Optimization priorities:

Minimize data movement (fused kernels, Flash Attention)
Use the fastest memory available (HBM > GDDR6 > system RAM)
Right data types (fp8/int8 KV cache for inference, fp16 for training)
Batch size tuning (maximize GPU utilization without exceeding memory)

The unified memory rabbit hole: modern inference engines (vLLM, TensorRT) manage this explicitly instead of relying on automatic paging—they know the memory access patterns and prefetch strategically.

Back Donate