FineWeb Data Download and Preparation | Generated by AI

Home 2025.11

Downloading the Parquet File

The URL you provided points to a specific Parquet file from the FineWeb dataset (a cleaned subset of CommonCrawl data). This file (000_00000.parquet from the CC-MAIN-2013-20 dump) contains ~1 million rows of English web text, with each row including a text column that’s already preprocessed for language model training (HTML stripped, deduplicated, quality-filtered via tools like Trafilatura).

To download it:

For the full dump (hundreds of files), use snapshot_download as shown in the dataset docs, but start with this single file for testing.

Extracting Text

FineWeb’s text column is plain text ready for training—no need to parse HTML or raw HTML. Use pandas or pyarrow to load it efficiently. Here’s how:

  1. Install Dependencies (if needed): pip install pandas pyarrow datasets (assuming you have them for NanoGPT setup).

  2. Load the Parquet File and Extract Text:

    import pandas as pd
    import os
    
    # Path to your downloaded file
    parquet_path = "./fineweb_data/data/CC-MAIN-2013-20/000_00000.parquet"
    
    # Load the Parquet file (efficient for large files)
    df = pd.read_parquet(parquet_path, columns=['text'])  # Only load the text column to save memory
    
    # Extract all text into a list (or iterate if memory-constrained)
    texts = df['text'].tolist()  # List of ~1M strings
    
    # Optional: Basic cleaning (FineWeb is already clean, but normalize whitespace)
    import re
    def clean_text(text):
        if pd.isna(text):  # Skip nulls (rare in FineWeb)
            return ''
        text = re.sub(r'\s+', ' ', text.strip())  # Collapse whitespace
        return text if len(text) > 10 else ''  # Filter very short texts
    
    cleaned_texts = [clean_text(t) for t in texts if t]  # Apply filter
    
    print(f"Extracted {len(cleaned_texts)} text samples")
    print("Sample:", cleaned_texts[0][:200] + "...")  # Preview first text
    
    • Memory Tip: This file has ~1M rows, each text ~1-5k characters. On a machine with 16GB RAM, it loads fine. For larger dumps, use pyarrow for chunked reading:
      import pyarrow.parquet as pq
      
      table = pq.read_table(parquet_path, columns=['text'])
      texts = table['text'].to_pylist()
      
    • Streaming Alternative (no full download needed): Use the datasets library to iterate without loading everything:
      from datasets import load_dataset
      
      # Load specific dump in streaming mode
      dataset = load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2013-20", split="train", streaming=True)
      
      texts = []
      for i, sample in enumerate(dataset):
          if i >= 100000:  # Limit for testing
              break
          texts.append(sample['text'])
      
      # Clean as above
      

Preparing for NanoGPT Training

NanoGPT expects a single text file (e.g., input.txt) with concatenated documents separated by a special token (like "<|endoftext|>"). Here’s how to format the extracted text:

# After extracting cleaned_texts from above
with open('input.txt', 'w', encoding='utf-8') as f:
    for text in cleaned_texts:
        f.write(text + "\n<|endoftext|>\n")  # NanoGPT's default separator

print("Saved to input.txt – ready for NanoGPT!")

This single file gives ~5-10B tokens—great for fine-tuning. For more, download additional Parquet files and concatenate. If you hit memory issues or need help with NanoGPT config, share details!

Reference: FineWeb Dataset Card


Back

x-ai/grok-4-fast

Donate