Comparing Google Cloud Speech-to-Text Models | Generated by AI
Here’s a clear breakdown of the differences between the various Speech-to-Text models in Google Cloud’s API V2—organized by model family and use case, without tables to keep it readable:
Chirp Family (Universal Speech Models - USM)
chirp
- What it is: Google’s original Universal Speech Model (USM), supporting diverse multilingual transcription in non-streaming contexts.(Google Cloud)
- Usage: Works with synchronous (
Recognize
) and batch (BatchRecognize
) methods; does not support streaming.(Google Cloud) -
Limitations:
- No streaming support
- Lacks confidence scores, diarization, adaptation, forced normalization, and word-level confidence(Google Cloud)
chirp_2
- What it is: Next-gen Universal Speech Model, more accurate and efficient than the original, with streaming, synchronous, and batch support. Offers multilingual transcription and translation, as well as model adaptation.(Google Cloud, Medium)
chirp_3
- What it is: The latest generation with further improvements in accuracy and speed. Supports streaming, synchronous, and batch recognition, plus speaker diarization and automatic language detection.(Google Cloud)
-
Feature support:
- Streaming (
StreamingRecognize
), synchronous (Recognize
), and batch (BatchRecognize
) all supported(Google Cloud) - Supports diarization and language detection(Google Cloud)
- Doesn’t support word-level timestamps or adaptation(Google Cloud)
- Streaming (
Legacy / General-Purpose Models
These are older architecture models maintained primarily for backward compatibility:
long
: Good for long-form content like media or spontaneous conversations.(Google Cloud)short
: Optimized for very short utterances (a few seconds)—ideal for commands.(Google Cloud)telephony
/telephony_short
: Tailored for phone-call audio (typically 8kHz). The “short” variant handles short or single-word utterances.(Google Cloud)-
Medical models:
medical_conversation
: For dialogs between provider and patient, with speaker separation.medical_dictation
: For single-speaker dictated notes.(Google Cloud)
Summary of Model Differences by Use Case
- Need best accuracy in non-streaming, multilingual? →
chirp
- Need streaming support + multilingual + adaptation? →
chirp_2
- Want streaming, diarization, auto-language detection, and highest accuracy? →
chirp_3
- Just simple long content transcription? →
long
- Processing short spoken commands? →
short
- Phone call recordings? →
telephony
ortelephony_short
- Medical use cases? →
medical_conversation
ormedical_dictation
Context from the Google Documentation
Google’s “Select a transcription model” page (updated September 4, 2025) includes all of the above model identifiers and descriptions. It also notes that the legacy models are not actively maintained but retained for backward compatibility.(Google Cloud)
And an earlier note about “latest_long” / “latest_short” refers to another set of conformer-based models—but these are outside the Chirp lineage and not relevant if you’re focused specifically on the Chirp or legacy models.(Google Cloud)
TL;DR—How to Pick:
- Highest quality + new features (diarization, streaming, etc.) →
chirp_3
- Streaming & model adaptation support →
chirp_2
- Non-streaming multilingual baseline →
chirp
- Short utterances, long audio, phone, medical → use appropriate legacy model (
short
,long
,telephony
,medical_*
)
Let me know if you’d like help choosing the right model based on your specific audio type, language, or need streaming vs batch processing!