Common Crawl October 2025 Dataset Summary | Generated by AI

Home 2025.11

This table is the official summary of the Common Crawl October 2025 crawl, specifically the dataset labeled CC-MAIN-2025-43. Common Crawl is a non-profit project that crawls the entire public web every month or two and releases the raw data for free so researchers, companies, and anyone else can use it (e.g., for training large language models like GPT, LLaMA, Grok, etc.).

Here’s what each part of the table means:

Basic facts

The different data types and what they contain

Data Type What it is Total uncompressed size # of files Compressed size
WARC The raw, full crawl data (complete HTTP responses: headers + HTML + embedded resources) ~ hundreds of TiB 100,000 97.73 TiB
WAT Metadata extracted from the WARC files (e.g., outgoing links, language, content-length, etc.) in JSON format   100,000 18.39 TiB
WET Only the extracted plain text (no HTML tags, no boilerplate, just cleaned text)   100,000 7.38 TiB
Robots.txt files All robots.txt files that were fetched during the crawl   100,000 0.15 TiB
Non-200 responses Responses that were not successful (404s, 500s, redirects, etc.)   100,000 3.07 TiB
URL index files Index that lets you look up which WARC file contains a specific URL (old format)   302 0.20 TiB
Columnar URL index Newer, faster columnar index (Parquet format) for big-data tools like Spark, DuckDB, etc.   900 0.23 TiB

Why the different formats exist

How the files are organized

Everything is split into 100, followed by segments (roughly 1 TiB compressed each for WARC). Each segment contains many smaller .warc.gz (or .wat.gz, .wet.gz) files. The *.paths.gz files you see listed are just text files that contain the full list of S3 URLs for each type.

You can access everything for free on AWS S3 at: s3://commoncrawl/crawl-data/CC-MAIN-2025-43/ (or via HTTPS: https://data.commoncrawl.org/)

In short: This is one of the largest open web datasets in the world (2.61 billion pages from October 2025), freely available in different processed forms, and the table is just telling you how big each processed version is and where to find the file lists.


Back

x-ai/grok-4-fast

Donate