Datasets for LLM Pretraining | Generated by AI

Home 2026.07

Question: Besides FineWeb, what other text datasets can I use for LLM pretraining?

Answer:

Here’s the practical landscape, organized by what role each dataset plays in a training mix.

General web-text (FineWeb alternatives/peers)

Curated / multi-source

Code

Math (highest-leverage per token for reasoning)

Multilingual

Practical recipe for your scale

For a GPT-2 760M / nanoGPT-class run (~10–100B tokens, MI300X), a strong mix:

# Chinchilla-ish mix for a ~1B model, streaming from HF
from datasets import load_dataset, interleave_datasets

mix = interleave_datasets([
    load_dataset("HuggingFaceFW/fineweb-edu", "sample-100BT", split="train", streaming=True),
    load_dataset("mlfoundations/dclm-baseline-1.0", split="train", streaming=True),
    load_dataset("bigcode/starcoderdata", data_dir="python", split="train", streaming=True),
    load_dataset("HuggingFaceTB/finemath", "finemath-4plus", split="train", streaming=True),
], probabilities=[0.55, 0.25, 0.10, 0.10], seed=42)

Two higher-leverage moves than swapping datasets:

  1. Ablate DCLM vs FineWeb-Edu yourself at 760M / 10B tokens — results differ by benchmark, and running the ablation teaches you more than any leaderboard. Karpathy’s nanochat uses FineWeb-Edu shards precisely because it’s the reliable default, not because it always wins.
  2. Study the filtering classifiers, not just the outputs. FineWeb-Edu’s is a lightweight regressor trained on Llama-3-70B educational-value annotations over embeddings — you can retrain that pipeline for a custom domain (e.g., an agent/tooling-focused corpus) in a weekend on your 4070.

References:


Back Donate