SPGISpeech Dataset Loaded and Explored

SPGISpeech Dataset Loaded and Explored | Generated by AI

Home 2026.07

Done. SPGISpeech dataset is loaded and explored. Here’s what we found:

Structure

Directory	Files	Train rows	Size
`S/`	12 parquet	~77k train + 39k val + 39k test	45 GB
`M/`	34 parquet	~138k train + …	134 GB
`L/`	146 parquet	~633k train + …	580 GB
`dev/`	3 parquet	validation split only	11 GB
`test/`	3 parquet	test split only	11 GB

Schema (ParquetFile.schema_arrow)

wav_filename: string       # e.g. "ed71da8f216a4f688c495b1be3050c5f/62.wav"
audio: struct<bytes: binary, path: string>  # raw WAV bytes in 'bytes', filename in 'path'
wav_filesize: int32
transcript: string         # formatted text with caps/punctuation/numbers

Audio sample: 1-channel, 16-bit, 16kHz WAV — 12.7s clip, fine with soundfile directly.

The dataset library (datasets 5.x) issue: it needs torchcodec for Audio decoding, which in turn needs CUDA torch. The pyarrow direct route (pq.ParquetFile.read_row_group(…) + soundfile.read(BytesIO)) works fine and avoids all that.

Your setup: /mnt/data/zz/spgispeech/explore.py ready to run:

cd /mnt/data/zz/spgispeech
source .venv/bin/activate
python3 explore.py S    # or M, L, dev, test

Back Donate