Technical Analysis of Nanochat Training Constraints | Generated by AI

Home 2026.07

Question: Translate the provided technical analysis of nanochat training status and dataset constraints into English.

Answer:

Analysis: Current Status of /mnt/data/zz + nanochat Training

What you have done previously


Available datasets

Dataset Format Size Suitable for nanochat?
github-code (38 shards, 1126 total files) Parquet, content column 11GB on disk (320GB on HF) ❌ Needs text column
github-code tokenized (140 .npy files) nanoGPT uint16 GPT-2 BPE 27GB (~14B tokens) ❌ Incompatible format
sec-edgar (17 × 10-K files) Parquet with text column 10GB ✅ Ready to use
spgispeech Whisper speech dataset 48GB ❌ Not relevant
fineweb-edu raw deleted

Bottlenecks

  1. Disk space constraint

    • Only 84GB available
    • Cannot fully download large FineWeb-Edu subset (~20–40GB+ needed)
  2. Format mismatch

    • nanochat requires a text column in Parquet
    • github-code uses content instead of text
  3. Data directory not configured

    • NANOCHAT_DATA_DIR is currently empty and must be set
  4. Pre-tokenized .npy files unusable

    • They are in nanoGPT GPT-2 BPE format
    • nanochat uses its own tokenizer and expects raw text

Given the 84GB storage limit, the most practical strategy is:

1. Convert GitHub code dataset


2. Add SEC-EDGAR dataset


3. Add a small FineWeb-Edu subset

Total dataset size:


4. Configure data directory


5. Train tokenizer on mixed data


6. Train nanochat d12 model


Alternative simpler option

Skip FineWeb entirely:

Total: ~25GB

Advantages:


Next step

You need to decide between:

Let me know which direction you want to go.


Back Donate