Fixing Dataset Split Error | Generated by AI

Home 2025.11

The error occurs because your local dataset file (/home/lzw/projects/nanoGPT/data/openwebtext_local/openwebtext.txt) contains only 1 sample (likely a single line or empty-ish content), but the script attempts to split it into train/test sets with test_size=0.1. This would allocate ~0.1 samples to the test set (rounded to 1 sample) and leave 0 for training, which datasets.train_test_split() rejects to avoid an empty train set.

Quick Fixes

  1. Use a larger/more complete dataset:
    • The nanoGPT OpenWebText preparation expects a substantial corpus (millions of documents). Your local openwebtext.txt seems incomplete—possibly just a placeholder or failed download.
    • Download a proper OpenWebText sample or subset:
      • Clone the OpenWebText repo: git clone https://github.com/jcpeterson/openwebtext.git into data/openwebtext_local/.
      • Or use a pre-processed version from Hugging Face: Install datasets if not already (pip install datasets), then modify prepare.py to load via load_dataset("openwebtext", split="train") instead of a local file.
      • For testing, create a dummy multi-line file in openwebtext.txt with at least 10+ lines of sample text (e.g., repeat “Hello world.” multiple times) to get past the split.
  2. Modify the split in prepare.py (around line 50):
    • If you must use this tiny file for debugging, change the split parameters:
      # Original (causes error):
      split_dataset = dataset.train_test_split(test_size=0.1, seed=2357, shuffle=True)
      
      # Option A: Skip splitting entirely (no test set, just train on all):
      split_dataset = {"train": dataset}  # Or set test_size=0
      
      # Option B: Use a smaller test_size that works with n=1 (e.g., 0, or handle tiny datasets):
      if len(dataset) <= 1:
          split_dataset = {"train": dataset}
      else:
          split_dataset = dataset.train_test_split(test_size=0.1, seed=2357, shuffle=True)
      
    • Re-run: python data/openwebtext_local/prepare.py.
  3. Verify your dataset size:
    • Add a print statement before the split in prepare.py (line ~45):
      print(f"Dataset size: {len(dataset)}")
      
    • Run again to confirm— it should output 1. If the file is empty/corrupt, regenerate it or check file permissions/encoding (UTF-8 expected).

If you share more of prepare.py or the file contents, I can provide a patched version.


Back

x-ai/grok-4-fast

Donate