GPT-2 760M Training Code Sources | Generated by AI
Question: Where to get the training code to train a GPT-2 760M scale model (like what I did on AMD MI300X)?
Answer:
BigCode is about code LLMs (StarCoder family), not GPT-2 pretraining. For GPT-2 760M scale training, here are the canonical codebases ranked by relevance to your setup:
1. nanoGPT — Karpathy’s reference implementation (start here)
git clone https://github.com/karpathy/nanoGPT
This is the gold standard for GPT-2 scale pretraining from scratch. You’ve already used it on MI300X. Key files:
model.py— GPT-2 architecture (CausalSelfAttention, MLP, Block, GPT)train.py— full DDP training loop with gradient accumulation, cosine LR, evalconfig/train_gpt2.py— configs for 124M, 355M, 774M, 1558M
For 760M specifically, train_gpt2_medium.py or just set:
n_layer = 36
n_head = 20
n_embd = 1280
That’s GPT-2 XL territory (~760M params).
2. Megatron-LM — production-grade, multi-GPU/multi-node
git clone https://github.com/NVIDIA/Megatron-LM
For MI300X you’d use the ROCm fork or AMD’s own:
git clone https://github.com/ROCm/Megatron-LM
Supports tensor parallelism + pipeline parallelism. More complex but scales to 100B+. Overkill for 760M but worth knowing.
3. LitGPT — Lightning AI’s clean multi-model trainer
git clone https://github.com/Lightning-AI/litgpt
Covers GPT-2, Llama, Mistral, Phi with a unified training interface. Cleaner than nanoGPT for production use, has LoRA/QLoRA built in.
4. BigCode’s training repo — if you want code-specific pretraining
git clone https://github.com/bigcode-project/bigcode-dataset # data pipeline
git clone https://github.com/bigcode-project/starcoder # training scripts
StarCoder uses Megatron under the hood with FIM (Fill-in-Middle) objective. The Stack v2 is your data source.
Quickstart for 760M on MI300X (nanoGPT path)
git clone https://github.com/karpathy/nanoGPT && cd nanoGPT
# Prepare FineWeb or OpenWebText
python data/openwebtext/prepare.py
# config/train_gpt2_760m.py
cat > config/train_gpt2_760m.py << 'EOF'
wandb_log = True
wandb_project = 'gpt2-760m'
dataset = 'openwebtext'
# 760M params
n_layer = 36
n_head = 20
n_embd = 1280
block_size = 1024
dropout = 0.0
# training
batch_size = 12
gradient_accumulation_steps = 40 # effective batch = 480 * 1024 tokens
max_iters = 600000
lr_decay_iters = 600000
learning_rate = 1e-4
min_lr = 1e-5
warmup_iters = 2000
EOF
# Launch with ROCm/HIP on MI300X
torchrun --nproc_per_node=1 train.py config/train_gpt2_760m.py
For multi-GPU on MI300X (it has 192GB HBM3 so you can fit this easily on 1 GPU):
# Single GPU, 760M fits in ~6GB weights + optimizer states ~24GB total
# MI300X has 192GB so you have massive headroom for bigger batch sizes
Data options (better than OpenWebText)
| Dataset | Size | HF path |
|---|---|---|
| FineWeb | 15T tokens | HuggingFaceFW/fineweb |
| FineWeb-Edu | 1.3T edu tokens | HuggingFaceFW/fineweb-edu |
| OpenWebText | ~9B tokens | local via nanoGPT prepare script |
| The Stack v2 | 67.5TB code | bigcode/the-stack-v2-dedup |
FineWeb-Edu is the current best for a GPT-2 style language model if you want strong benchmarks.
Summary
| Repo | Best for |
|---|---|
karpathy/nanoGPT |
Learning, clean 760M training, single node |
ROCm/Megatron-LM |
Production, multi-node, tensor parallelism on MI300X |
Lightning-AI/litgpt |
Clean multi-model, LoRA, less boilerplate |
bigcode/starcoder |
Code-specific pretraining with FIM objective |
Since you’re already familiar with nanoGPT from your MI300X run — the natural next step is either scaling data (FineWeb instead of OpenWebText) or switching to Megatron for tensor parallelism experiments.
References: