whisper : add speaker diarization support#3732
Conversation
Add speaker diarization based on ECAPA-TDNN speaker embeddings.
When enabled via --diarize, each transcription segment gets assigned a
speaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.
New files:
- src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clustering
- src/whisper-speaker.cpp/h: GGML model loader
- models/convert-speaker-to-ggml.py: SpeechBrain model converter
Usage:
python models/convert-speaker-to-ggml.py --output models/ggml-speaker-ecapa-tdnn.bin
./whisper-cli -m models/ggml-base.en.bin \
--diarize --diarize-model models/ggml-speaker-ecapa-tdnn.bin -f input.wav
The feature is compile-gated behind WHISPER_DIARIZE and has zero overhead
when disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).
Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n^2) clustering.
Resolves: ggml-org#64
|
I recommend to create synthetic dataset with multiple speakers with text to speech models to benchmark locally and verify the method works. |
|
Good idea. I tested with real multi-speaker audio and the embeddings discriminate well (cosine distance >0.7 across speakers, <0.3 same speaker), but a reproducible TTS benchmark would be nice to have. Will look into it. |
whisper_clustering_context_create() overwrites the old pointer without freeing it first. When the same whisper_state is reused across multiple inference runs, the previous clustering context leaks. Free it before creating a new one.
- cli/server: add --diarize-model, --diarize-threshold, --diarize-speakers - unify speaker label logic across all output formats (txt/vtt/srt/csv/json/lrc/wts) - fall back to stereo diarization when no model is provided - fix memory leak in whisper_compute_mel_80, move allocs out of hot loop - thread-safe static init with std::call_once - rename hann → hamming (was actually hamming), remove dead code - dynamic ggml context sizing, WHISPER_LOG_* macros in speaker loader - fix n_channels 512 → 1024 in python converter - server: ARGV_NEXT bounds checking for all args
|
PM note: this PR is active and conceptually in review/in-progress, not backlog. The current thread is discussing a reproducible benchmark approach for multi-speaker testing, and the author said they will look into it. Next move is to produce that benchmark/repro so review can continue cleanly. — little John |
Diarization Benchmark (VoxConverse dev subset)Ran a quick benchmark against pyannote.audio 3.1 on 8 files from VoxConverse dev set (2-5 speakers, 68-664s). Apple M3, 16GB. Results
whisper.cpp: RTF=0.11 (265s for 2310s audio) ~5.2x faster, single binary, ~200MB vs ~3GB memory. Approach: 2s sliding window embeddings with 1s hop, energy-based silence filtering, agglomerative clustering (average linkage, cosine distance threshold 0.70), token-level speaker assignment with majority voting. Works well on 2-4 speaker scenarios (asxwr 2.0%, bxpwa 4.1%, akthc 5.9%). Main weakness is dense multi-speaker audio with similar voices (afjiv, 5 speakers). bkwns has a speaker with only 2.5s of speech — hard for any embedding-based approach. Eval setup: collar=0.25s, skip_overlap=False, pyannote.metrics. |
The ggml context pool could run out of space for some segment lengths where the estimate was a few MB short. Add 10% margin to the allocation.
Validation set (9 files outside original subset, 2–10 speakers)
whisper.cpp: RTF=0.12 (203s for 1762s audio) |
DER improvement explorationSpent some time trying to push DER lower: Tried: Silero VAD replacing energy-based silence filtering Not pursued (out of scope):
Current implementation works well for typical use cases (2-5 speakers, clear speech). Tightening DER further would be a separate effort. |
|
Hi @MoonMao42 |
No, it's too complicated.If I need to meet the thesis standards, there are too many controlled conditions, and it is possible to train the model. This is not something that can be solved by single binary, and I'm not sure my ability to make a better model. |
Add speaker diarization based on ECAPA-TDNN speaker embeddings.
When enabled via
--diarize, each transcription segment gets assigned aspeaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.
New files:
src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clusteringsrc/whisper-speaker.cpp/h: GGML model loadermodels/convert-speaker-to-ggml.py: SpeechBrain model converterUsage:
The feature is compile-gated behind
WHISPER_DIARIZEand has zero overheadwhen disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).
Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n²) clustering.
Resolves: #64