Fixing Dataset Split Error | Generated by AI
The error occurs because your local dataset file (/home/lzw/projects/nanoGPT/data/openwebtext_local/openwebtext.txt) contains only 1 sample (likely a single line or empty-ish content), but the script attempts to split it into train/test sets with test_size=0.1. This would allocate ~0.1 samples to the test set (rounded to 1 sample) and leave 0 for training, which datasets.train_test_split() rejects to avoid an empty train set.
Quick Fixes
- Use a larger/more complete dataset:
- The nanoGPT OpenWebText preparation expects a substantial corpus (millions of documents). Your local
openwebtext.txtseems incomplete—possibly just a placeholder or failed download. - Download a proper OpenWebText sample or subset:
- Clone the OpenWebText repo:
git clone https://github.com/jcpeterson/openwebtext.gitintodata/openwebtext_local/. - Or use a pre-processed version from Hugging Face: Install
datasetsif not already (pip install datasets), then modifyprepare.pyto load viaload_dataset("openwebtext", split="train")instead of a local file. - For testing, create a dummy multi-line file in
openwebtext.txtwith at least 10+ lines of sample text (e.g., repeat “Hello world.” multiple times) to get past the split.
- Clone the OpenWebText repo:
- The nanoGPT OpenWebText preparation expects a substantial corpus (millions of documents). Your local
- Modify the split in
prepare.py(around line 50):- If you must use this tiny file for debugging, change the split parameters:
# Original (causes error): split_dataset = dataset.train_test_split(test_size=0.1, seed=2357, shuffle=True) # Option A: Skip splitting entirely (no test set, just train on all): split_dataset = {"train": dataset} # Or set test_size=0 # Option B: Use a smaller test_size that works with n=1 (e.g., 0, or handle tiny datasets): if len(dataset) <= 1: split_dataset = {"train": dataset} else: split_dataset = dataset.train_test_split(test_size=0.1, seed=2357, shuffle=True) - Re-run:
python data/openwebtext_local/prepare.py.
- If you must use this tiny file for debugging, change the split parameters:
- Verify your dataset size:
- Add a print statement before the split in
prepare.py(line ~45):print(f"Dataset size: {len(dataset)}") - Run again to confirm— it should output
1. If the file is empty/corrupt, regenerate it or check file permissions/encoding (UTF-8 expected).
- Add a print statement before the split in
Recommended Next Steps for nanoGPT
- For full training, follow nanoGPT’s official data prep: See the nanoGPT README for downloading/preparing Shakespeare or OpenWebText properly. OpenWebText requires significant storage (~40GB uncompressed).
- If this is for local testing, switch to the built-in Shakespeare dataset (smaller, no external downloads needed): Edit
prepare.pyto loadload_dataset("tiny_shakespeare")or use nanoGPT’sdata/shakespeare_char/prepare.py. - Ensure dependencies:
pip install datasets torch(and others from nanoGPT’srequirements.txt).
If you share more of prepare.py or the file contents, I can provide a patched version.