MI300X NanoChat Training Complete | Generated by AI
Everything is up and running. Here’s the summary:
============================================================ NANOCCHAT 760M TRAINING - MI300X SETUP COMPLETE ============================================================
WHAT WAS DONE:
- Cloned nanochat (Karpathy’s LLM training harness)
- Set up Python venv with ROCm PyTorch 2.9.1+rocm6.4
- Downloaded 31 ClimbMix-400B dataset shards (~25B tokens)
- Trained BPE tokenizer (32,768 vocab)
- Verified model builds and trains on MI300X
- Launched full training run in background
MODEL CONFIG: Architecture: depth=24, dim=1536, heads=12, head_dim=128 Parameters: 760M (transformer) + 604M (value embeddings) = 1.38B total Dataset: ClimbMix-400B (successor to FineWeb) Training: 29,000 steps, batch=524,288 tokens/step Tokens: ~15.2B (Chinchilla-optimal: 20x params)
PERFORMANCE (MI300X): Throughput: ~68,000 tokens/sec Step time: ~7.7 seconds MFU: ~27.5% (SDPA fallback, no FA3 on AMD) Peak VRAM: ~105 GB / 192 GB Est. total: ~62 hours
FILES CREATED: /root/nanochat/run_mi300x_d24.sh # Full pipeline (pretrain + SFT + eval) /root/nanochat/run_mi300x_d24_pretrain.sh # Pretrain only /root/nanochat/run_mi300x_d24.log # Training log (live)
BACKGROUND PROCESS: Session: proc_c4c8f9ccdff7 PID: 272714 Status: Running (loss dropping: 10.4 → 9.6 in first 8 steps)
MONITORING: tail -f /root/nanochat/run_mi300x_d24.log
CHECKPOINTS: ~/.cache/nanochat/base_checkpoints/d24/ (saved every 5000 steps)
RESUME (if crash):
./run_mi300x_d24_pretrain.sh –resume-from-step=
AFTER TRAINING: python -m scripts.chat_cli -p ‘Why is the sky blue?’ python -m scripts.chat_web