Dataloader Status and Next Steps | Generated by AI
All clear. Here’s the full picture:
DATA FORMAT CHECK
Schema: text, id, dump, url, file_path, language, language_score, token_count, score, int_score
142.6B tokens on disk (175 train shards + 1 val shard)
Avg 1,037 tokens/doc, 1000 docs/row_group, 786 row_groups/shard
DATALOADER CHECK
_document_batches() has `while True:` infinite loop (line 47)
Epochs increment: epoch 1 -> 2 -> 3 -> ... (never stops)
Supports resume via pq_idx/rg_idx/epoch state dict
Will cycle through all 175 train shards repeatedly ✓
COMPATIBILITY
Current data: fineweb-edu parquet (data__CC-MAIN-*.parquet)
Code expects: climbmix parquet (shard_XXXXX.parquet)
BUT: list_parquet_files() just globs *.parquet - works with any naming ✓
Both have 'text' column - dataloader reads that ✓
Tokenizer was trained on this same data ✓
Bottom line: no data download needed. You have 142.6B tokens and only used 0.655B. The dataloader will loop forever. Just increase iterations:
Current run: 10,000 steps = 0.655B tokens (0.46% of Chinchilla)
Chinchilla: 87,000 steps = 5.7B tokens (~28.5 hours on RTX 4070)
10B tokens: 153,000 steps (~50 hours)
Want me to create an updated run script with more iterations? What’s your target - Chinchilla optimal (87k steps / ~28h) or more aggressive?