Training GPT-3 6.7B on MI300X

Training GPT-3 6.7B on MI300X | Generated by AI

Home 2026.06

Question: Assuming nanoGPT is capable, how do I train GPT-3 6.7B end-to-end on MI300X using nanoGPT?

Answer:

Full Step-by-Step

Step 1: Setup Volume

# SSH into fresh droplet
ssh root@<new-ip>

# Mount your persistent volume
lsblk  # find your volume, e.g. /dev/sda
mkdir -p /mnt/data
mount /dev/sda /mnt/data

ls /mnt/data
# raw/  tokens/  checkpoints/

Step 2: Setup Environment

# ROCm is pre-installed on AMD GPU droplet
rocm-smi  # verify MI300X visible

pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.1
pip install numpy tiktoken datasets tqdm wandb

# Clone nanoGPT
cd /mnt/data
git clone https://github.com/karpathy/nanoGPT
cd nanoGPT

Step 3: Prepare Data (The Pile or OpenWebText)

# OpenWebText (~40GB, GPT-2 style) — smaller, good start
python data/openwebtext/prepare.py

# output:
# /mnt/data/tokens/train.bin  (~17GB)
# /mnt/data/tokens/val.bin    (~8MB)

For GPT-3 scale you really want more data — use The Pile:

# data/pile/prepare.py — write this yourself
from datasets import load_dataset
import numpy as np
import tiktoken

enc = tiktoken.get_encoding('gpt2')

def tokenize(example):
    ids = enc.encode_ordinary(example['text'])
    ids.append(enc.eot_token)
    return {'ids': ids, 'len': len(ids)}

dataset = load_dataset('EleutherAI/pile', split='train', streaming=True)

# write to binary
arr = np.memmap('/mnt/data/tokens/train.bin', dtype=np.uint16, mode='w+', shape=(400_000_000_000,))
# stream and fill...

Step 4: Patch nanoGPT for 6.7B

Two critical patches:

Patch 1 — Gradient Checkpointing:

# model.py — in GPT.forward()
from torch.utils.checkpoint import checkpoint

# replace this:
for block in self.transformer.h:
    x = block(x)

# with this:
for block in self.transformer.h:
    x = checkpoint(block, x, use_reentrant=False)

Patch 2 — Fix config path for data:

# in train.py, update data_dir default
data_dir = os.path.join('/mnt/data/tokens', dataset)

Step 5: Write the 6.7B Config

# config/train_gpt3_6b.py

# model
n_layer = 32
n_head  = 32
n_embd  = 4096
block_size = 2048
bias = False
dropout = 0.0

# data
dataset = 'pile'  # points to /mnt/data/tokens/pile
data_dir = '/mnt/data/tokens'

# training
batch_size = 4
gradient_accumulation_steps = 160  # effective batch = 640
max_iters = 600_000
warmup_iters = 2_000
lr_decay_iters = 600_000

# optimizer
learning_rate = 1e-4
min_lr = 1e-5
beta1 = 0.9
beta2 = 0.95
weight_decay = 0.1
grad_clip = 1.0

# precision — critical for MI300X
dtype = 'bfloat16'

# checkpointing
out_dir = '/mnt/data/checkpoints'
eval_interval = 1000
save_checkpoint_interval = 1000  # nanoGPT uses eval_interval for this
keep_last_n = 3  # you need to patch this in, see below

# logging
wandb_log = True
wandb_project = 'gpt3-6b'
wandb_run_name = 'mi300x-run1'

# system
device = 'cuda'  # ROCm exposes as cuda
compile = False  # torch.compile ROCm support is patchy, try True first

Step 6: Patch Checkpoint Rotation (keep last N)

nanoGPT saves one ckpt.pt and overwrites. Patch to keep last 3:

# in train.py, find where it saves checkpoint
# replace:
torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

# with:
ckpt_path = os.path.join(out_dir, f'ckpt_{iter_num}.pt')
torch.save(checkpoint, ckpt_path)

# rotate: delete old checkpoints beyond last 3
import glob
ckpts = sorted(glob.glob(os.path.join(out_dir, 'ckpt_*.pt')))
for old in ckpts[:-3]:
    os.remove(old)
    print(f"deleted old checkpoint: {old}")

Step 7: Launch Training

cd /mnt/data/nanoGPT

# single GPU (MI300X has 192GB, enough for 6.7B + grad checkpointing)
python train.py config/train_gpt3_6b.py

# if you want to be safe, run in tmux
tmux new -s train
python train.py config/train_gpt3_6b.py 2>&1 | tee /mnt/data/logs/train.log
# Ctrl+B D to detach

Step 8: Monitor

# GPU utilization
watch -n 1 rocm-smi

# training log
tail -f /mnt/data/logs/train.log

# wandb — open in browser
# https://wandb.ai/your-project/gpt3-6b

Expected output:

iter 0:    loss 11.07, time 4200ms, mfu 0.12%
iter 100:  loss 7.43,  time 3800ms, mfu 38.2%
iter 1000: loss 4.21,  time 3750ms, mfu 39.1%

Step 9: End of Session — Safe Shutdown

# In tmux, checkpoint is saved every eval_interval
# Ctrl+C cleanly, nanoGPT saves on exit? No — so wait for a checkpoint save
# then Ctrl+C

# unmount volume
umount /mnt/data

# detach via doctl from your MacBook
doctl compute volume-action detach <volume-id> <droplet-id>

# destroy droplet
doctl compute droplet delete <droplet-id> --force

Step 10: Resume Next Session

# new droplet, attach volume, mount
mount /dev/sda /mnt/data

# nanoGPT auto-resumes from latest checkpoint
python train.py config/train_gpt3_6b.py
# it finds ckpt_*.pt in out_dir and resumes automatically

Expected Training Time (Rough)

GPT-3 6.7B, 300B tokens (subset of Pile):

MI300X theoretical: ~200 TFLOPS bf16
Realistic MFU ~40%: ~80 TFLOPS effective

6.7B params × 6 FLOPs/token × 300B tokens
= 1.2 × 10^22 FLOPs total

1.2e22 / 80e12 = 150,000 seconds
= ~42 days continuous

At $2/hr:

42 days × 24hr × $2 = ~$2,016

That’s real money — so train on a subset first (10B tokens, ~3 days, ~$144) to validate loss is decreasing correctly before committing to a full run.

Bottom Line Checklist

✅ Mount /mnt/data volume
✅ pip install deps (torch ROCm, tiktoken)
✅ prepare data → /mnt/data/tokens/
✅ patch gradient checkpointing in model.py
✅ patch checkpoint rotation in train.py
✅ write config/train_gpt3_6b.py
✅ tmux + launch
✅ monitor rocm-smi + wandb
✅ safe shutdown → detach volume → destroy droplet
✅ resume next session by attaching volume

Back Donate