Memory Overload Crash Fixed | Generated by AI
Already finished. Here’s the full picture:
COMPREHENSIVE SUMMARY
PROBLEM
The tokenize script hung/crashed the workstation (192.168.1.36), making it so overloaded that sshd couldn’t even complete the SSH banner exchange — TCP port 22 accepted connections but never responded. Required a physical reboot.
ROOT CAUSE
The original script used multiprocessing.Pool(16) with pool.imap(). Each of the 16 worker processes:
- Loaded an ENTIRE parquet file into memory via
pq.read_table()(~280 MB on disk, larger decompressed) - Iterated rows and accumulated tokens into a Python list (
all_tokens.extend()) - One parquet file produces ~369M tokens. A Python list of 369M int objects = ~10 GB RAM (Python int overhead ~28 bytes each)
- 16 workers running in parallel = 160+ GB RAM needed on a 62 GB machine
Result: kernel OOM killer or catastrophic swap thrashing. The machine froze hard enough that sshd was starved of CPU/memory to send its banner.
Additionally, when workers get OOM-killed, pool.imap() hangs silently in the main process — no error, no output, just “Tokenizing with 16 workers…” followed by eternal silence.
THE FIX
Rewrote the script with three key changes:
-
NO MULTIPROCESSING: Single-process. Each parquet file is already huge (~370M tokens), no benefit from parallelism when the bottleneck is memory, not CPU.
-
STREAMING PARQUET READS: Replaced
pq.read_table()(loads entire file) withpq.ParquetFile.iter_batches(batch_size=8192)— processes 8192 rows at a time, never holds a full file in memory. -
NUMPY BUFFER ACCUMULATION: Replaced Python list (
all_tokens.extend()) with a pre-allocated numpy uint16 array of 100M tokens (200 MB). Tokens are written directly into this buffer with index tracking. When full, flush to disk as .npy shard. No Python list intermediate at all.
RESULT
Input: 38 parquet files, 10.2 GB Output: 141 shards (1 val + 140 train), 27 GB on disk Tokens: ~14.07 billion Time: 41.1 minutes (~5.7M tok/s) Peak RAM: ~600 MB (vs 160+ GB with old script)
Next step: training. The output is in /mnt/data/zz/datasets/github-code-tok/ in nanoGPT binary format, ready to use.