Gradient Accumulation for Large Batches | Generated by AI

Home 2025.11

Andrej Karpathy’s Explanation of Gradient Accumulation Steps

In his YouTube video “Let’s build GPT: from scratch, in code, spelled out” (a detailed tutorial on implementing and training a GPT-like model from the ground up), Andrej Karpathy discusses gradient accumulation around the 2:51:40 mark. He uses it primarily as a practical workaround for training large language models on hardware with limited GPU memory, while still simulating the large effective batch sizes (e.g., 0.5 million tokens) described in the original GPT-2 and GPT-3 papers. Here’s a breakdown of his key points:

Why Use Gradient Accumulation?

In essence, it’s a “serial simulation” of parallelism: you process data in smaller chunks but accumulate the gradients as if it were one big batch, enabling scalable training without needing a data center.

How It Works (in His Implementation)

Karpathy implements it explicitly in PyTorch:

  1. Set a small micro-batch size (e.g., 4-8 samples) that fits in GPU memory.
  2. Run forward and backward passes for accumulation_steps iterations (e.g., 32 steps), summing gradients element-wise each time without updating weights.
  3. After all steps, perform one optimizer step with the accumulated gradients.
  4. Scale the loss by 1/accumulation_steps to normalize.

This ensures the effective batch size is micro_batch_size × accumulation_steps × sequence_length. He combines it with tricks like mixed-precision training, FlashAttention, and torch.compile for efficiency.

Bonus: Implicit Accumulation via Weight Tying

Karpathy also ties this to weight tying (sharing the token embedding matrix wte with the output head in GPT-2 style). During backprop, gradients from both the input embeddings and output logits “naturally accumulate” into the same tensor—saving ~30% parameters and adding an inductive bias that input/output representations should align. Quote: “In the backward pass we’ll get gradient contributions from both branches… these gradients will add up on the wte tensor.”

Overall, these techniques let you train GPT-scale models on a single GPU while faithfully replicating paper results. For the full context, watch from ~2:50:00 onward in the video.

Let’s build GPT: from scratch, in code, spelled out


Back

x-ai/grok-4.1-fast

Donate