SPGISpeech Dataset Overview | Generated by AI

Home 2026.07

Question: Give an introduction to the SPGISpeech dataset on Hugging Face (kensho/spgispeech).

Answer:

What it is

SPGISpeech is a corpus of 5,000 hours of professionally-transcribed financial audio, containing company earnings calls, released by Kensho Technologies (an S&P Global subsidiary) and published via the SPGISpeech paper (O’Neill et al., 2021). It’s the standard “formal business speech” benchmark in the ASR world — it’s one of the 8 datasets in the Hugging Face ESB (End-to-end Speech Benchmark) and shows up in hf-audio/open_asr_leaderboard, so any Whisper/Conformer/wav2vec2 paper you read likely reports WER on it.

Why it’s interesting for ASR training

Gotchas

  1. Gated dataset. You must accept Kensho’s Terms of Usage on the HF page before load_dataset will work — it’s not a simple click-through, it’s a legal agreement restricting redistribution and re-identification attempts. Once accepted, standard HF auth (huggingface-cli login or HF_TOKEN env var) unlocks the download.
  2. Size. Total is 781 GB, but there are three train configs so you don’t have to pull it all:
Subset Size
S 22GB
M 107GB
L 530GB (superset of M, which is superset of S)
dev 11GB
test 11GB

Minimal usage

from datasets import load_dataset

# S config for quick iteration/fine-tuning experiments
spgi = load_dataset("kensho/spgispeech", "S")

sample = spgi["train"][0]
audio_array = sample["audio"]["array"]     # np.float32, 16kHz mono
sampling_rate = sample["audio"]["sampling_rate"]
transcript = sample["transcript"]

# dev/test only, without pulling any train split
spgi_test = load_dataset("kensho/spgispeech", "test")

Given your training work on GPT-2/MoE from scratch, this dataset is a decent choice if you ever want to fine-tune or eval a Whisper/Conformer-style encoder-decoder on formatted-text ASR — the punctuation/capitalization requirement makes the task strictly harder than raw phoneme transcription, closer to what a production dictation or meeting-transcription product needs.

References:


Back Donate