FineWeb Download and GPT-3 Training Guide | Generated by AI

Home 2026.05

Question: Regarding the script plan_and_download_fineweb_gpt3.py — is downloading ~100B tokens from FineWeb good or bad, and is that expected? Also, how do you use the downloaded FineWeb data to train a GPT-3 style model?


Answer:

Part 1: Is 100B Tokens Good, Bad, or Expected?

It’s a reasonable and intentional subset — not the full dataset.

FineWeb is built from 96 snapshots of Common Crawl, spanning from 2013 to early 2024, and comprises 15 trillion GPT-2 tokens of predominantly English text. So 100B tokens is only about 0.67% of the full FineWeb dataset.

That said, 100B tokens is a perfectly legitimate target for a GPT-3 scale experiment:

In short: 100B tokens is a deliberate, practical subset — good for a research experiment or smaller GPT-3 reproduction. It’s expected behavior for a planning/downloading script that targets a manageable slice of the dataset.


Part 2: How to Train a GPT-3 Style Model on FineWeb Data

Here is the full pipeline:

Step 1 — Download & Tokenize

Your script likely downloads FineWeb shards from Hugging Face. After downloading, tokenize with GPT-2’s BPE tokenizer (tiktoken):

import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode(text)

Save shards as binary .bin files (uint16 numpy arrays) — this is the standard nanoGPT format.

Step 2 — Set Up Training Framework

Andrej Karpathy’s build-nanogpt repo starts from an empty file and works toward a reproduction of GPT-2 (124M). If you have more patience or money, the code can also reproduce the GPT-3 models.

Clone it:

git clone https://github.com/karpathy/build-nanogpt
cd build-nanogpt
pip install torch numpy tiktoken

Step 3 — Configure GPT-3 Hyperparameters

GPT-3 uses these key hyperparameters (vary by model size):

Model Layers Heads d_model Params
GPT-3 Small 12 12 768 125M
GPT-3 Medium 24 16 1024 350M
GPT-3 XL 24 16 2048 1.3B
GPT-3 175B 96 96 12288 175B

In your config file:

# config/train_gpt3_small.py
n_layer = 12
n_head = 12
n_embd = 768
block_size = 2048       # GPT-3 uses 2048 context
batch_size = 512
learning_rate = 6e-4
max_iters = 300000
dataset = 'fineweb'

Step 4 — Launch Training

Single GPU:

python train.py config/train_gpt3_small.py

Multi-GPU (recommended for 100B tokens):

torchrun --standalone --nproc_per_node=8 train.py config/train_gpt3_small.py

To reproduce GPT-2 (124M) you’ll want at least an 8x A100 40GB node. For GPT-3 scale, you’ll need more — typically multiple nodes.

Step 5 — Alternative: Use Karpathy’s nanochat (2025)

Karpathy’s nanochat project is an open-source LLM coded in roughly 8,000 lines of PyTorch. It introduces key features such as supervised fine-tuning (SFT), reinforcement learning (RL), and enhanced inference. The training pipeline employs quality data such as FineWeb-EDU for pretraining.

A ~$300 run (12 hours) creates a model better than GPT-2. A ~$1000 run (~42 hours) produces a model with noticeably better reasoning and coherence.

Step 6 — Evaluate

Standard benchmarks for GPT-3 reproductions:

Use the lighteval or lm-evaluation-harness library to run these automatically.


Summary

Aspect Detail
100B tokens vs full FineWeb ~0.67% of 15T total — a practical research subset
Is it expected? Yes, typical for a planning script targeting a manageable chunk
Best training framework Karpathy’s build-nanogpt or nanochat
Recommended starting size GPT-3 Small (125M) → needs ~8x A100s, ~$10–$100
Full GPT-3 (175B) Needs hundreds of A100s + weeks — not practical for individuals

References:


Back Donate