Adding Speaker Diarization to Whisper | Generated by AI

Home 2026.06

Question: How to add speaker diarization (speaker labels) to my OpenAI Whisper transcription script?

Answer:

Your current script calls the whisper CLI directly — no diarization. Whisper itself has no speaker labeling built in. The standard approach is to pipe Whisper’s output through a diarization pipeline. Two real options:


Option 1: pyannote.audio + Whisper (most accurate, open-source)

This is the gold standard. pyannote does speaker diarization separately, then you align timestamps with Whisper segments.

Architecture

audio → Whisper (segments + timestamps) → pyannote (speaker segments) → merge by timestamp overlap

Install

pip install pyannote.audio openai-whisper
# Requires HuggingFace token — accept pyannote model license at:
# https://huggingface.co/pyannote/speaker-diarization-3.1

Drop-in replacement for your script

#!/usr/bin/env python3
"""
whisper_diarize.py — Whisper transcription with pyannote speaker labels
"""

import argparse
import os
import sys
import whisper
from pyannote.audio import Pipeline
import torch

def assign_speakers(diarization, whisper_segments):
    """
    For each Whisper segment, find the dominant speaker from pyannote diarization.
    Uses max overlap strategy.
    """
    results = []
    for seg in whisper_segments:
        seg_start, seg_end = seg["start"], seg["end"]
        speaker_times = {}

        for turn, _, speaker in diarization.itertracks(yield_label=True):
            overlap_start = max(turn.start, seg_start)
            overlap_end = min(turn.end, seg_end)
            overlap = max(0, overlap_end - overlap_start)
            if overlap > 0:
                speaker_times[speaker] = speaker_times.get(speaker, 0) + overlap

        dominant = max(speaker_times, key=speaker_times.get) if speaker_times else "UNKNOWN"
        results.append({
            "start": seg_start,
            "end": seg_end,
            "speaker": dominant,
            "text": seg["text"].strip(),
        })
    return results


def main():
    parser = argparse.ArgumentParser(description="Whisper + speaker diarization")
    parser.add_argument("input_file")
    parser.add_argument("--model", default="large")
    parser.add_argument("--language", default="Chinese")
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
    parser.add_argument("--hf-token", required=True, help="HuggingFace token for pyannote")
    parser.add_argument("--num-speakers", type=int, default=None, help="Hint: number of speakers")
    args = parser.parse_args()

    if not os.path.isfile(args.input_file):
        print(f"Error: {args.input_file} not found"); sys.exit(1)

    print(f"[1/3] Loading Whisper {args.model} on {args.device}...")
    model = whisper.load_model(args.model, device=args.device)

    print("[2/3] Transcribing...")
    result = model.transcribe(args.input_file, language=args.language, verbose=False)

    print("[3/3] Running speaker diarization...")
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=args.hf_token
    ).to(torch.device(args.device))

    diarize_kwargs = {}
    if args.num_speakers:
        diarize_kwargs["num_speakers"] = args.num_speakers

    diarization = pipeline(args.input_file, **diarize_kwargs)

    segments = assign_speakers(diarization, result["segments"])

    output_path = os.path.splitext(args.input_file)[0] + "_diarized.txt"
    with open(output_path, "w", encoding="utf-8") as f:
        for seg in segments:
            line = f"[{seg['start']:.1f}s → {seg['end']:.1f}s] {seg['speaker']}: {seg['text']}"
            print(line)
            f.write(line + "\n")

    print(f"\nSaved to {output_path}")


if __name__ == "__main__":
    main()

Usage:

python whisper_diarize.py meeting.mp4 \
  --model large \
  --language Chinese \
  --device cuda \
  --hf-token hf_xxx \
  --num-speakers 2

Option 2: whisperx (fastest, single package)

whisperx wraps Whisper + faster-whisper + pyannote into one CLI. Much simpler if you just want it to work.

pip install whisperx
whisperx meeting.mp4 \
  --model large-v3 \
  --language zh \
  --diarize \
  --hf_token hf_xxx \
  --output_dir ./out

Output includes a _diarize.json and .txt with SPEAKER_00:, SPEAKER_01: labels per segment.

To integrate into your existing script pattern:

import subprocess

def run_whisperx(input_file, model="large-v3", language="zh", hf_token="hf_xxx", output_dir="."):
    cmd = [
        "whisperx", input_file,
        "--model", model,
        "--language", language,
        "--diarize",
        "--hf_token", hf_token,
        "--output_dir", output_dir,
        "--device", "cuda",
    ]
    subprocess.run(cmd, check=True)

whisperx also does word-level alignment using wav2vec2 — much more accurate timestamps than vanilla Whisper, which matters for diarization boundary assignment.


Comparison

  pyannote DIY whisperx
Control Full Opinionated
Speed Slow (2 passes) Fast (faster-whisper backend)
Word alignment Manual Built-in
CLI-friendly No Yes
Custom diarization models Yes Partially

Key gotcha

Both require accepting the pyannote model license on HuggingFace:

You need a HF account + huggingface-cli login or pass --hf_token.


My recommendation for your setup: Use whisperx on your RTX 4070 workstation. It’s one command, faster-whisper backend is significantly faster than vanilla whisper for long meetings, and diarization is built-in. The large-v3 model + CUDA should handle a 1-hour meeting in a few minutes.

References:


Back Donate