Designing a Custom Midsize GPT-2 Model | Generated by AI

Home 2026.05

Question: Can we design and train a custom GPT-2 model with a parameter count between the small (124M) and medium (350M) sizes, and is it feasible to train such a model on a single RTX 4070?


Answer:

Yes, absolutely. The GPT-2 architecture in nanoGPT is fully configurable — you are not restricted to the four official OpenAI sizes. You simply dial in n_layer, n_head, and n_embd to land at any parameter count you want between 124M and 350M.


How GPT-2 Sizing Works

The four standard configs in nanoGPT’s model.py are:

Model n_layer n_head n_embd Params
gpt2 12 12 768 124M
gpt2-medium 24 16 1024 350M
gpt2-large 36 20 1280 774M
gpt2-xl 48 25 1600 1558M

These are just defaults. nanoGPT’s GPTConfig accepts arbitrary values, so you can design anything in between.


Example “Middle” Configs (~200M)

There is no single canonical ~200M config, but here are three well-behaved options. The rule of thumb: n_embd must be divisible by n_head, and prefer powers-of-2-friendly numbers for GPU efficiency.

Option A — Wider, shallower (~190M)

n_layer = 16
n_head  = 16
n_embd  = 896   # 896 / 16 = 56 head_dim (fine)

Estimated params: ~190M

Option B — Balanced (~220M)

n_layer = 18
n_head  = 16
n_embd  = 896

Estimated params: ~215M

Option C — Closer to medium (~260M)

n_layer = 20
n_head  = 16
n_embd  = 960   # 960 / 16 = 60 head_dim

Estimated params: ~255M

You can use nanoGPT’s own transformer_sizing.ipynb notebook to get exact parameter counts for any config before committing to training.


VRAM Estimation on RTX 4070 (12 GB)

A rough rule for training in bfloat16:

Component Memory
Model weights ~2 bytes × params
Gradients ~2 bytes × params
AdamW optimizer states ~8 bytes × params
Activations (batch-dependent) variable

For a ~200M model with batch_size=1, block_size=1024, bfloat16:

This is meaningfully more comfortable than trying to squeeze the 350M model (which pushes 11–12 GB even at batch_size=1).


# config/train_gpt2_200m_4070.py
n_layer = 18
n_head  = 16
n_embd  = 896
block_size = 1024
vocab_size = 50257

batch_size                  = 2    # micro-batch; try 4 if it fits
gradient_accumulation_steps = 240  # keeps ~0.5M effective tokens/step
dtype   = 'bfloat16'
compile = True  # torch.compile gives ~20–30% speedup on Ada

# Optional but recommended:
# gradient checkpointing (patch Block.forward) to trade ~25% speed for ~2–3x activation memory savings
# 8-bit AdamW via bitsandbytes saves ~1–2 GB on optimizer states

Training Time Estimate

Model 4070 (12 GB) solo Expected val loss
124M (small) ~1–2 weeks ~3.0–3.1
~200M (custom) ~3–5 weeks ~2.92–3.0
350M (medium) 1–3 months (aggressive tricks) ~2.85–2.95

A ~200M custom model is the practical “sweet spot” for a single RTX 4070: it trains from scratch in a matter of weeks (not months), fits in VRAM without aggressive surgery, and gives noticeably better loss than the 124M model.


Practical Recommendation


References:


Back Donate