Fixing Dataset Split Error

Fixing Dataset Split Error | Generated by AI

Home 2025.11

The error occurs because your local dataset file (/home/lzw/projects/nanoGPT/data/openwebtext_local/openwebtext.txt) contains only 1 sample (likely a single line or empty-ish content), but the script attempts to split it into train/test sets with test_size=0.1. This would allocate ~0.1 samples to the test set (rounded to 1 sample) and leave 0 for training, which datasets.train_test_split() rejects to avoid an empty train set.

Quick Fixes

Use a larger/more complete dataset:
- The nanoGPT OpenWebText preparation expects a substantial corpus (millions of documents). Your local openwebtext.txt seems incomplete—possibly just a placeholder or failed download.
- Download a proper OpenWebText sample or subset:
  - Clone the OpenWebText repo: git clone https://github.com/jcpeterson/openwebtext.git into data/openwebtext_local/.
  - Or use a pre-processed version from Hugging Face: Install datasets if not already (pip install datasets), then modify prepare.py to load via load_dataset("openwebtext", split="train") instead of a local file.
  - For testing, create a dummy multi-line file in openwebtext.txt with at least 10+ lines of sample text (e.g., repeat “Hello world.” multiple times) to get past the split.

Modify the split in prepare.py (around line 50):

If you must use this tiny file for debugging, change the split parameters:

# Original (causes error):
split_dataset = dataset.train_test_split(test_size=0.1, seed=2357, shuffle=True)

# Option A: Skip splitting entirely (no test set, just train on all):
split_dataset = {"train": dataset}  # Or set test_size=0

# Option B: Use a smaller test_size that works with n=1 (e.g., 0, or handle tiny datasets):
if len(dataset) <= 1:
    split_dataset = {"train": dataset}
else:
    split_dataset = dataset.train_test_split(test_size=0.1, seed=2357, shuffle=True)

Re-run: python data/openwebtext_local/prepare.py.

Verify your dataset size:
- Add a print statement before the split in prepare.py (line ~45):
```
print(f"Dataset size: {len(dataset)}")
```
- Run again to confirm— it should output 1. If the file is empty/corrupt, regenerate it or check file permissions/encoding (UTF-8 expected).

Recommended Next Steps for nanoGPT

For full training, follow nanoGPT’s official data prep: See the nanoGPT README for downloading/preparing Shakespeare or OpenWebText properly. OpenWebText requires significant storage (~40GB uncompressed).
If this is for local testing, switch to the built-in Shakespeare dataset (smaller, no external downloads needed): Edit prepare.py to load load_dataset("tiny_shakespeare") or use nanoGPT’s data/shakespeare_char/prepare.py.
Ensure dependencies: pip install datasets torch (and others from nanoGPT’s requirements.txt).

If you share more of prepare.py or the file contents, I can provide a patched version.

Back Donate