The 80/10/10 Data Split Explained | Generated by AI
Question: Can you explain the 80/10/10 train/validation/test split used in nanoGPT and nanochat, and why this separation matters?
Answer:
Yes, exactly. The 8/1/1 split (80% train, 10% val, 10% test) is standard in nanoGPT/nanochat. Here’s the principle and implementation:
Why Three Sets?
Train set (80%): Learn the parameters. Gradient updates only happen here.
Validation set (10%): Tune hyperparameters and detect overfitting. You evaluate on val during training (every N steps) to watch loss diverge between train and val — that’s your overfitting signal.
Test set (10%): Final evaluation. You touch this once at the end, after all hyperparameter tuning is locked. This gives an unbiased estimate of generalization performance.
The critical thing: val and test must be completely separated from train. If you touch test during training (to pick hyperparams), you leak information and your final number is meaningless.
Implementation in nanoGPT
# Typical split in nanoGPT
data = np.memmap('data.bin', dtype=np.uint16, mode='r')
n = len(data)
train_data = data[:int(0.8*n)] # 80% for training
val_data = data[int(0.8*n):int(0.9*n)] # 10% for validation
test_data = data[int(0.9*n):] # 10% for test
def get_batch(split, batch_size=32):
"""Sample a batch during training"""
if split == 'train':
ix = torch.randint(len(train_data) - context_length, (batch_size,))
x = torch.stack([torch.from_numpy((train_data[i:i+context_length]).astype(np.int64)) for i in ix])
y = torch.stack([torch.from_numpy((train_data[i+1:i+1+context_length]).astype(np.int64)) for i in ix])
else: # val or test
ix = torch.randint(len(val_data) - context_length, (batch_size,))
x = torch.stack([torch.from_numpy((val_data[i:i+context_length]).astype(np.int64)) for i in ix])
y = torch.stack([torch.from_numpy((val_data[i+1:i+1+context_length]).astype(np.int64)) for i in ix])
return x, y
The Training Loop Pattern
for iter in range(max_iters):
# Sample from train, compute loss, backprop
logits, loss = model(get_batch('train'))
loss.backward()
optimizer.step()
# Every eval_interval, check validation loss (no gradients)
if iter % eval_interval == 0:
with torch.no_grad():
val_loss = estimate_loss('val') # Forward pass only on val set
print(f"iter {iter}: train_loss {train_loss:.4f}, val_loss {val_loss:.4f}")
The validation loss tells you if you’re overfitting:
- If
val_loss >> train_loss: Model memorized training data, not generalizing - If they track together: Good generalization, keep going
For Language Models Specifically
The split philosophy shifts slightly at scale:
-
For small models (nanoGPT scale): 8/1/1 is fine. Val and test are comparable.
- For large models: Validation becomes more important than test:
- You do 10-100x more steps on val during training (gradient decisions)
- Test is only evaluated once at the end (not used for tuning)
- Some teams use 9/0.5/0.5 or even larger train ratios because more data always helps
- For production models: You might use a held-out test distribution (different domain/time period) to catch distribution shift that val didn’t catch.
Practical Consideration for Large Datasets
If you’re training on billions of tokens (like you’re exploring with MI300X), the 10% val/test can be gigabytes of data. You can be more aggressive:
# For multi-billion token datasets, 5% val/test is acceptable
train_split = 0.90
val_split = 0.95 # 5% val, 5% test
The absolute size of val/test matters more than the percentage — you need enough tokens to get stable loss estimates, but beyond ~1GB of validation data, diminishing returns set in.
Bottom line: Keep val and test strictly separated. Use val loss during training to detect overfitting and tune LR/dropout/etc. Touch test once, after everything is locked. This is non-negotiable if your results are credible.