GPU Memory Issue and Training Options

GPU Memory Issue and Training Options | Generated by AI

Home 2026.04

Health check done. Summary:

Good signs ✓

Problem ✗ — OOM on backward pass:

The original 11,500-iter checkpoint clearly trained when the GPU was free. To train again you have two options:

Option A — free the GPU first (recommended, matches original training config):

kill 2202    # stop llama-server
nvidia-smi   # confirm GPU is empty
python3 train.py config/train_fineweb.py

Option B — keep llama-server, train with smaller per-step batch (same effective batch size via more grad-accum):

python3 train.py config/train_fineweb.py --batch_size=2 --gradient_accumulation_steps=256 --compile=False

In both cases, watch for iter 0: loss ~10.9 then dropping. Ctrl+C anytime — you can restore the trained checkpoint with:

mkdir -p out-fineweb && cp out-fineweb-backup/ckpt.pt out-fineweb/ckpt.pt

Back Donate