GPT-2 760M Training Code Sources | Generated by AI

Home 2026.06

Question: Where to get the training code to train a GPT-2 760M scale model (like what I did on AMD MI300X)?

Answer:

BigCode is about code LLMs (StarCoder family), not GPT-2 pretraining. For GPT-2 760M scale training, here are the canonical codebases ranked by relevance to your setup:


1. nanoGPT — Karpathy’s reference implementation (start here)

git clone https://github.com/karpathy/nanoGPT

This is the gold standard for GPT-2 scale pretraining from scratch. You’ve already used it on MI300X. Key files:

For 760M specifically, train_gpt2_medium.py or just set:

n_layer = 36
n_head = 20
n_embd = 1280

That’s GPT-2 XL territory (~760M params).


2. Megatron-LM — production-grade, multi-GPU/multi-node

git clone https://github.com/NVIDIA/Megatron-LM

For MI300X you’d use the ROCm fork or AMD’s own:

git clone https://github.com/ROCm/Megatron-LM

Supports tensor parallelism + pipeline parallelism. More complex but scales to 100B+. Overkill for 760M but worth knowing.


3. LitGPT — Lightning AI’s clean multi-model trainer

git clone https://github.com/Lightning-AI/litgpt

Covers GPT-2, Llama, Mistral, Phi with a unified training interface. Cleaner than nanoGPT for production use, has LoRA/QLoRA built in.


4. BigCode’s training repo — if you want code-specific pretraining

git clone https://github.com/bigcode-project/bigcode-dataset  # data pipeline
git clone https://github.com/bigcode-project/starcoder        # training scripts

StarCoder uses Megatron under the hood with FIM (Fill-in-Middle) objective. The Stack v2 is your data source.


Quickstart for 760M on MI300X (nanoGPT path)

git clone https://github.com/karpathy/nanoGPT && cd nanoGPT

# Prepare FineWeb or OpenWebText
python data/openwebtext/prepare.py

# config/train_gpt2_760m.py
cat > config/train_gpt2_760m.py << 'EOF'
wandb_log = True
wandb_project = 'gpt2-760m'
dataset = 'openwebtext'

# 760M params
n_layer = 36
n_head = 20
n_embd = 1280
block_size = 1024
dropout = 0.0

# training
batch_size = 12
gradient_accumulation_steps = 40  # effective batch = 480 * 1024 tokens
max_iters = 600000
lr_decay_iters = 600000
learning_rate = 1e-4
min_lr = 1e-5
warmup_iters = 2000
EOF

# Launch with ROCm/HIP on MI300X
torchrun --nproc_per_node=1 train.py config/train_gpt2_760m.py

For multi-GPU on MI300X (it has 192GB HBM3 so you can fit this easily on 1 GPU):

# Single GPU, 760M fits in ~6GB weights + optimizer states ~24GB total
# MI300X has 192GB so you have massive headroom for bigger batch sizes

Data options (better than OpenWebText)

Dataset Size HF path
FineWeb 15T tokens HuggingFaceFW/fineweb
FineWeb-Edu 1.3T edu tokens HuggingFaceFW/fineweb-edu
OpenWebText ~9B tokens local via nanoGPT prepare script
The Stack v2 67.5TB code bigcode/the-stack-v2-dedup

FineWeb-Edu is the current best for a GPT-2 style language model if you want strong benchmarks.


Summary

Repo Best for
karpathy/nanoGPT Learning, clean 760M training, single node
ROCm/Megatron-LM Production, multi-node, tensor parallelism on MI300X
Lightning-AI/litgpt Clean multi-model, LoRA, less boilerplate
bigcode/starcoder Code-specific pretraining with FIM objective

Since you’re already familiar with nanoGPT from your MI300X run — the natural next step is either scaling data (FineWeb instead of OpenWebText) or switching to Megatron for tensor parallelism experiments.

References:


Back Donate