Memory Bottleneck in Large Language Models | Generated by AI

Home 2026.06

Question: Why is memory (and unified memory) so critical for LLM training and inference?

Answer:

The fundamental reason is compute-to-memory bandwidth mismatch. Modern GPUs/accelerators have vastly more compute than memory bandwidth, so the bottleneck during LLM training/inference is almost always memory movement, not math.

The Core Problem

A GPU like RTX 4070 or MI300X can perform ~teraFLOPS of compute, but memory bandwidth is limited:

Device Peak Compute Memory BW Compute:BW Ratio
RTX 4070 ~29 TF32-TFLOPS 432 GB/s ~67
MI300X ~360 TFLOPS (FP8) 5.3 TB/s ~68

To keep the GPU busy, you need to move data in and out of memory fast enough to feed the compute units.

If you have 67 bytes of data per FLOP, and compute takes 1 nanosecond per FLOP, you need 67 bytes * 1 ns = 67 ns of memory latency tolerance. GDDR6 can’t sustain that. HBM (High Bandwidth Memory) exists specifically to solve this: HBM3 on MI300X gives you 5.3 TB/s vs GDDR6’s ~400 GB/s—10x more bandwidth.

Training: Activation Memory Explodes

During forward pass, you must keep all intermediate activations in memory for the backward pass. For a transformer:

Activation memory = batch_size × seq_len × hidden_dim × num_layers × bytes_per_param

Example (13B model, bsz=1, seq=4096):
= 1 × 4096 × 5120 × 40 × 2 bytes (fp16)
≈ 1.6 GB just for activations, per layer
× 40 layers ≈ 64 GB

This is linear in sequence length. Longer context = more memory needed. This is why:

Inference: KV Cache Dominates

During inference, you cache key/value vectors to avoid recomputing attention:

KV cache size = batch_size × seq_len × hidden_dim × 2 × num_layers × bytes

For 13B model, bsz=32, seq=4096:
= 32 × 4096 × 5120 × 2 × 40 × 2 bytes (fp16)
≈ 52 GB

This scales quadratically with sequence length and batch size. If you want to serve 128 concurrent users with 8K context, you need massive memory. This is why:

Why Unified Memory Matters

NVIDIA’s Unified Memory Architecture (UVA) allows CPU and GPU to access the same virtual address space, with automatic paging between them. This sounds great but has a critical flaw:

Paging is slow. If activations spill to system RAM, you’re moving data over PCIe (16 GB/s on PCIe 4.0) instead of HBM (5.3 TB/s). You lose 300x bandwidth.

However, unified memory is useful for:

For pure LLM training, you’re better off with explicit memory management (allocate on GPU, keep it there) than relying on unified memory’s automatic paging.

Practical For Your Situation

MI300X (192 GB HBM3): You can train larger models in full precision and fit larger batches because of the huge bandwidth and capacity. This is why it’s so strong for 760M GPT-2 training—you have headroom.

RTX 4070 (12 GB VRAM): You’re constrained. For serious work, you need:

Optimization priorities:

  1. Minimize data movement (fused kernels, Flash Attention)
  2. Use the fastest memory available (HBM > GDDR6 > system RAM)
  3. Right data types (fp8/int8 KV cache for inference, fp16 for training)
  4. Batch size tuning (maximize GPU utilization without exceeding memory)

The unified memory rabbit hole: modern inference engines (vLLM, TensorRT) manage this explicitly instead of relying on automatic paging—they know the memory access patterns and prefetch strategically.


Back Donate