FineWeb-Edu nanochat integration steps | Generated by AI

Home 2026.06

OK, now I have the full picture. Here’s what you need to do:

Your FineWeb-Edu parquet files at /mnt/data/zz/datasets/fineweb-edu/ (176 files, 374GB) already have a text column — that’s exactly what nanochat needs. But nanochat can’t use nanoGPT’s .bin files because nanochat does its own tokenization with its own BPE tokenizer.

Step-by-step:

1. Link your data into nanochat’s expected location

The existing ~/.cache/nanochat/base_data_climbmix/ has only 9 shards (788MB). Replace it with your data:

# Back up existing data
mv ~/.cache/nanochat/base_data_climbmix ~/.cache/nanochat/base_data_climbmix.bak

# Symlink your FineWeb-Edu parquets
ln -s /mnt/data/zz/datasets/fineweb-edu ~/.cache/nanochat/base_data_climbmix

2. Retrain the tokenizer on your data

The existing tokenizer was trained on ClimbMix. You need one trained on FineWeb-Edu:

cd /mnt/data/nanochat
.venv/bin/python -m scripts.tok_train

This reads parquet files, trains a BPE tokenizer (vocab 32K by default), and saves to ~/.cache/nanochat/tokenizer/.

3. Train the model

Single GPU (your RTX 4070, 12GB VRAM):

cd /mnt/data/nanochat
.venv/bin/python -m scripts.base_train \
  --depth=4 \
  --max-seq-len=512 \
  --device-batch-size=4 \
  --total-batch-size=512 \
  --num-iterations=500 \
  --eval-every=100 \
  --eval-tokens=512 \
  --core-metric-every=-1 \
  --sample-every=200 \
  --window-pattern=L

Key flags to adjust based on your VRAM:

Important notes:


Back Donate