Optimized nanoGPT Config for RTX 4070 | Generated by AI

Home 2025.11

Here’s an adjusted nanoGPT config optimized for your RTX 4070 (12 GB VRAM) when using just this single Parquet file (~3.1 GB, from CC-MAIN-2013-20/000_00000.parquet).

Key Adjustments and Why

out_dir = 'out-fineweb-single-parquet'
eval_interval = 500       # Evaluate more often on small data
eval_iters = 200
log_interval = 50         # Log more frequently
always_save_checkpoint = True

wandb_log = True          # Optional
wandb_project = 'fineweb'
wandb_run_name = '125M-single-parquet-4070'

dataset = 'fineweb'       # Assumes you adapted prepare.py for your single file
gradient_accumulation_steps = 32     # Effective batch size: 16 * 32 = 512 sequences
batch_size = 16
block_size = 1024                    # Matches FineWeb's processing

# Model (~125M parameters) – perfect for 12 GB VRAM
n_layer = 12
n_head = 12
n_embd = 512
dropout = 0.0                        # Add 0.1 if overfitting
learning_rate = 5e-4                 # Slightly lower for smaller data
max_iters = 6000                     # ~3B tokens seen (adjust up to 10000 if loss keeps dropping)
warmup_iters = 500                   # Shorter warmup
lr_decay_iters = 6000
min_lr = 5e-5
beta2 = 0.99

# Extras for speed/stability
compile = True            # PyTorch compile for 20–30% faster training
bias = False              # Like LLaMA/Mistral
weight_decay = 0.1

Smaller Model Option (If You Want Faster Training or Less Memory)

If the above hits ~11 GB VRAM peak, try this ~84M param config (uses ~7–8 GB):

n_layer = 6
n_head = 6
n_embd = 384
learning_rate = 6e-4      # Higher for smaller model
max_iters = 8000          # Compensate with more iters (~3–4B tokens)

Still solid quality, but weaker than 125M.

Training Tips

This should give you a decent small GPT on this subset. If you share loss curves or exact token count from prepare.py, I can refine further! 🚀


Back

x-ai/grok-4-fast

Donate