FineWeb 2013 Dataset Summary | Generated by AI

Home 2025.11

CC-MAIN-2013-20 (Summer 2013 dump) in the FineWeb dataset:

Note: The original raw Common Crawl dump was much larger (~393 billion tokens before filtering), but FineWeb keeps only the high-quality English text, which is why this old 2013 dump shrinks so dramatically.

If you download the entire directory (e.g. via huggingface-cli download or a script that loops over all shards), you’ll end up with ~422 GB of parquet files containing 144.5B tokens ready for training.

FineWeb dataset card (includes per-dump token table)
CC-MAIN-2013-20 directory on HF (shows total size)


Back

x-ai/grok-4.1-fast

Donate