Dataloader Status and Next Steps

Dataloader Status and Next Steps | Generated by AI

Home 2026.06

All clear. Here’s the full picture:

DATA FORMAT CHECK
  Schema: text, id, dump, url, file_path, language, language_score, token_count, score, int_score
  142.6B tokens on disk (175 train shards + 1 val shard)
  Avg 1,037 tokens/doc, 1000 docs/row_group, 786 row_groups/shard

DATALOADER CHECK
  _document_batches() has `while True:` infinite loop (line 47)
  Epochs increment: epoch 1 -> 2 -> 3 -> ... (never stops)
  Supports resume via pq_idx/rg_idx/epoch state dict
  Will cycle through all 175 train shards repeatedly ✓

COMPATIBILITY
  Current data: fineweb-edu parquet (data__CC-MAIN-*.parquet)
  Code expects: climbmix parquet (shard_XXXXX.parquet)
  BUT: list_parquet_files() just globs *.parquet - works with any naming ✓
  Both have 'text' column - dataloader reads that ✓
  Tokenizer was trained on this same data ✓

Bottom line: no data download needed. You have 142.6B tokens and only used 0.655B. The dataloader will loop forever. Just increase iterations:

Current run:   10,000 steps = 0.655B tokens  (0.46% of Chinchilla)
Chinchilla:    87,000 steps = 5.7B tokens    (~28.5 hours on RTX 4070)
10B tokens:   153,000 steps                  (~50 hours)

Want me to create an updated run script with more iterations? What’s your target - Chinchilla optimal (87k steps / ~28h) or more aggressive?

Back Donate