FineWeb Dataset Size Details

FineWeb Dataset Size Details | Generated by AI

Home 2026.05

Question: Is FineWeb 44TB? Is that its maximum data size?

Answer:

Yes, 44TB is correct — but it refers to the disk space of the original FineWeb v1.0 release, not an absolute maximum.

Here’s a comprehensive breakdown:

🍷 FineWeb — Original Release (v1.0)

Hugging Face released FineWeb on May 31, 2024, as a large-scale dataset drawing from 96 CommonCrawl snapshots, encompassing 15 trillion tokens and occupying 44TB of disk space.

So yes — 44TB is the correct disk size figure for the original FineWeb dataset.

Has the Dataset Grown Since Then?

The dataset has been updated and expanded. The current FineWeb dataset (as of the Hugging Face page) now consists of more than 18.5 trillion tokens (originally 15 trillion tokens) of cleaned and deduplicated English web data from CommonCrawl.

This means the 44TB figure corresponds to the initial 15T token release, and the actual size has grown with subsequent updates.

FineWeb-Edu Subset

Alongside the main dataset, Hugging Face introduced FineWeb-Edu, a subset of 1.3 trillion tokens from FineWeb that was rated as highly educational by a custom classifier. This subset is much smaller than the full dataset.

Key Facts Summary

Property	Value
Source	96 CommonCrawl snapshots
Original token count	15 trillion tokens
Updated token count	18.5+ trillion tokens
Disk space (original)	44TB
FineWeb-Edu subset	1.3 trillion tokens
Language	English only
License	ODC-By 1.0

Conclusion: 44TB is the disk footprint of the original FineWeb v1.0, not a hard cap. The dataset has since grown beyond 18.5 trillion tokens, so the actual storage footprint is now larger than 44TB.

References:

Back Donate