Parquet Streaming Optimizations for Text Extraction | Generated by AI

Home 2026.05

Done. Key changes:

Memory ceiling now is roughly BATCH_SIZE × avg_doc_size ≈ a few hundred MB per shard, not the whole shard. You can run all 176 files without OOM.

One thing worth reconsidering before you launch a multi-hour run: the resulting .txt will be hundreds of GB and you’ll just re-read it to tokenize. Going parquet → tokenized .bin directly skips that. Say the word if you want me to write that variant instead.


Back Donate