Training Custom Datasets with nanoGPT | Generated by AI

Home 2025.10

Preparing and Training a Custom Text Dataset with nanoGPT

nanoGPT is designed for easy training on text data, including custom datasets. The process involves preparing your text into binary token files, configuring the training parameters, and running the training script. Below, I’ll outline the steps based on the repository’s examples (e.g., Shakespeare or OpenWebText). You’ll need Python 3.8+ and a GPU for efficient training (CPU works but is slower).

1. Install Dependencies

First, set up the environment:

pip install torch numpy transformers datasets tiktoken wandb tqdm

2. Prepare Your Custom Dataset

nanoGPT expects your data as binary files (train.bin and val.bin) containing tokenized integers. You’ll need to write a simple preparation script to process your raw text.

3. Configure Training

Create a config file by copying an example (e.g., config/train_shakespeare_char.py) to config/train_my_dataset.py and edit it.

Key parameters to tweak:

# Example config snippet
out_dir = 'out-my_dataset'  # Output folder for checkpoints
dataset = 'my_dataset'      # Matches your data folder name
batch_size = 64             # Adjust based on GPU memory (e.g., 12 for small GPU)
block_size = 256            # Context length (tokens per example)
n_layer = 6                 # Transformer layers
n_head = 6                  # Attention heads
n_embd = 384                # Embedding dimension
max_iters = 5000            # Training steps
lr = 6e-4                   # Learning rate
dropout = 0.2               # Dropout rate
init_from = 'scratch'       # 'scratch' for new model; 'gpt2' to load pretrained

4. Run Training

Train from scratch or finetune:

python train.py config/train_my_dataset.py

5. Generate Samples (Test the Model)

After training, sample text:

python sample.py --out_dir=out-my_dataset --start="Your prompt here"

This loads the latest checkpoint and generates.

Common Issues

For more details, check the repo’s README and data/config examples.

nanoGPT GitHub


Back

x-ai/grok-4-fast

Donate