whisper : add speaker diarization support by MoonMao42 · Pull Request #3732 · ggml-org/whisper.cpp

MoonMao42 · 2026-03-30T14:53:19Z

Add speaker diarization based on ECAPA-TDNN speaker embeddings.

When enabled via --diarize, each transcription segment gets assigned a
speaker ID. The pipeline works by computing a 192-dim speaker embedding
per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering
them with agglomerative hierarchical clustering.

New files:

src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clustering
src/whisper-speaker.cpp/h: GGML model loader
models/convert-speaker-to-ggml.py: SpeechBrain model converter

Usage:

python models/convert-speaker-to-ggml.py --output models/ggml-speaker-ecapa-tdnn.bin
./whisper-cli -m models/ggml-base.en.bin \
  --diarize --diarize-model models/ggml-speaker-ecapa-tdnn.bin -f input.wav

The feature is compile-gated behind WHISPER_DIARIZE and has zero overhead
when disabled. Embeddings match SpeechBrain PyTorch output (cosine distance
< 0.05).

Known limitations: ~200MB memory per encoder context, no GPU backend,
O(n²) clustering.

Resolves: #64

Add speaker diarization based on ECAPA-TDNN speaker embeddings. When enabled via --diarize, each transcription segment gets assigned a speaker ID. The pipeline works by computing a 192-dim speaker embedding per segment using a ported SpeechBrain ECAPA-TDNN model, then clustering them with agglomerative hierarchical clustering. New files: - src/whisper-diarize.cpp/h: mel computation, ECAPA-TDNN forward pass, clustering - src/whisper-speaker.cpp/h: GGML model loader - models/convert-speaker-to-ggml.py: SpeechBrain model converter Usage: python models/convert-speaker-to-ggml.py --output models/ggml-speaker-ecapa-tdnn.bin ./whisper-cli -m models/ggml-base.en.bin \ --diarize --diarize-model models/ggml-speaker-ecapa-tdnn.bin -f input.wav The feature is compile-gated behind WHISPER_DIARIZE and has zero overhead when disabled. Embeddings match SpeechBrain PyTorch output (cosine distance < 0.05). Known limitations: ~200MB memory per encoder context, no GPU backend, O(n^2) clustering. Resolves: ggml-org#64

thewh1teagle · 2026-03-30T15:08:39Z

I recommend to create synthetic dataset with multiple speakers with text to speech models to benchmark locally and verify the method works.

MoonMao42 · 2026-03-30T15:12:36Z

Good idea. I tested with real multi-speaker audio and the embeddings discriminate well (cosine distance >0.7 across speakers, <0.3 same speaker), but a reproducible TTS benchmark would be nice to have. Will look into it.

whisper_clustering_context_create() overwrites the old pointer without freeing it first. When the same whisper_state is reused across multiple inference runs, the previous clustering context leaks. Free it before creating a new one.

- cli/server: add --diarize-model, --diarize-threshold, --diarize-speakers - unify speaker label logic across all output formats (txt/vtt/srt/csv/json/lrc/wts) - fall back to stereo diarization when no model is provided - fix memory leak in whisper_compute_mel_80, move allocs out of hot loop - thread-safe static init with std::call_once - rename hann → hamming (was actually hamming), remove dead code - dynamic ggml context sizing, WHISPER_LOG_* macros in speaker loader - fix n_channels 512 → 1024 in python converter - server: ARGV_NEXT bounds checking for all args

HDJohnbot · 2026-04-03T16:32:41Z

PM note: this PR is active and conceptually in review/in-progress, not backlog. The current thread is discussing a reproducible benchmark approach for multi-speaker testing, and the author said they will look into it. Next move is to produce that benchmark/repro so review can continue cleanly. — little John

MoonMao42 · 2026-04-03T18:12:07Z

Diarization Benchmark (VoxConverse dev subset)

Ran a quick benchmark against pyannote.audio 3.1 on 8 files from VoxConverse dev set (2-5 speakers, 68-664s). Apple M3, 16GB.

Results

File	Spks	Dur	W.cpp DER	W.cpp t	Pyan DER	Pyan t
akthc	2	115s	5.9%	14.1s	2.9%	60.3s
bkwns	2	68s	37.6%	7.1s	0.1%	34.7s
ampme	3	148s	10.3%	15.7s	7.6%	93.2s
asxwr	3	238s	2.0%	26.2s	0.8%	144.5s
ahnss	4	664s	10.9%	80.7s	3.2%	406.2s
afjiv	5	151s	50.7%	14.6s	5.5%	89.5s
bauzd	5	500s	13.4%	59.0s	6.1%	305.5s
bxpwa	5	426s	4.1%	47.4s	2.1%	256.2s
AVG		2310s	16.9%	33.1s	3.5%	173.8s

whisper.cpp: RTF=0.11 (265s for 2310s audio)
pyannote 3.1: RTF=0.60 (1390s for 2310s audio)

~5.2x faster, single binary, ~200MB vs ~3GB memory.

Approach: 2s sliding window embeddings with 1s hop, energy-based silence filtering, agglomerative clustering (average linkage, cosine distance threshold 0.70), token-level speaker assignment with majority voting.

Works well on 2-4 speaker scenarios (asxwr 2.0%, bxpwa 4.1%, akthc 5.9%). Main weakness is dense multi-speaker audio with similar voices (afjiv, 5 speakers). bkwns has a speaker with only 2.5s of speech — hard for any embedding-based approach.

Eval setup: collar=0.25s, skip_overlap=False, pyannote.metrics.

The ggml context pool could run out of space for some segment lengths where the estimate was a few MB short. Add 10% margin to the allocation.

MoonMao42 · 2026-04-04T02:45:47Z

Validation set (9 files outside original subset, 2–10 speakers)

File	Spks	Dur	W.cpp DER	W.cpp t	Pyan DER	Pyan t
cobal	2	76s	0.0%	8.1s	0.1%	35.4s
aufkn	3	181s	8.8%	18.8s	6.0%	91.0s
dhorc	4	303s	6.6%	30.8s	0.8%	176.2s
ccokr	5	201s	18.7%	27.3s	6.9%	128.6s
edixl	6	312s	14.1%	35.9s	1.0%	178.6s
fsaal	7	198s	5.0%	24.3s	4.5%	125.5s
eziem	8	178s	19.2%	19.2s	7.0%	613.6s
sqkup	9	131s	38.4%	17.7s	30.8%	75.2s
xmfzh	10	181s	20.4%	20.7s	3.8%	117.7s
AVG		1762s	14.6%	22.5s	6.8%	171.3s

whisper.cpp: RTF=0.12 (203s for 1762s audio)
pyannote 3.1: RTF=0.87 (1542s for 1762s audio)

MoonMao42 · 2026-04-04T03:13:40Z

DER improvement exploration

Spent some time trying to push DER lower:

Tried: Silero VAD replacing energy-based silence filtering
Swapped the RMS energy threshold (< 0.01) in the diarization window loop with Silero VAD speech probabilities. No measurable difference — whisper's ASR segments are already speech, so the energy check was rarely triggering anyway. Added ~3s overhead per file for model loading. Reverted.

Not pursued (out of scope):

Spectral clustering (replacing AHC) — papers show ~20-40% relative DER reduction, but needs an eigensolver in C from scratch, 300+ lines
Overlap-aware diarization — requires a separate overlap detection model
Multi-scale embedding windows (1s/2s/3s averaging) — likely marginal, doesn't fix the clustering bottleneck
VBx post-processing

Current implementation works well for typical use cases (2-5 speakers, clear speech). Tightening DER further would be a separate effort.

WilliamTambellini · 2026-05-07T23:18:00Z

Hi @MoonMao42
Is it based on
https://arxiv.org/abs/2104.01466
?

MoonMao42 · 2026-05-08T00:33:53Z

Hi @MoonMao42

Is it based on

https://arxiv.org/abs/2104.01466

?

No, it's too complicated.If I need to meet the thesis standards, there are too many controlled conditions, and it is possible to train the model. This is not something that can be solved by single binary, and I'm not sure my ability to make a better model.

MoonMao42 mentioned this pull request Mar 30, 2026

whisper : mark speakers/voices (diarization) #64

Open

MoonMao42 added 2 commits April 3, 2026 00:26

MoonMao42 marked this pull request as ready for review April 3, 2026 16:16

MoonMao42 added 2 commits April 4, 2026 02:13

whisper : fix speaker encoder OOM on certain audio lengths

7f92aa0

The ggml context pool could run out of space for some segment lengths where the estimate was a few MB short. Add 10% margin to the allocation.

whisper : improve diarization window assignment

c34ed63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : add speaker diarization support#3732

whisper : add speaker diarization support#3732
MoonMao42 wants to merge 5 commits into
ggml-org:masterfrom
MoonMao42:speaker-diarization

MoonMao42 commented Mar 30, 2026

Uh oh!

thewh1teagle commented Mar 30, 2026

Uh oh!

MoonMao42 commented Mar 30, 2026

Uh oh!

HDJohnbot commented Apr 3, 2026

Uh oh!

MoonMao42 commented Apr 3, 2026 •

edited

Loading

Uh oh!

MoonMao42 commented Apr 4, 2026

Uh oh!

MoonMao42 commented Apr 4, 2026

Uh oh!

WilliamTambellini commented May 7, 2026

Uh oh!

MoonMao42 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MoonMao42 commented Mar 30, 2026

Uh oh!

thewh1teagle commented Mar 30, 2026

Uh oh!

MoonMao42 commented Mar 30, 2026

Uh oh!

HDJohnbot commented Apr 3, 2026

Uh oh!

MoonMao42 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Diarization Benchmark (VoxConverse dev subset)

Uh oh!

MoonMao42 commented Apr 4, 2026

Validation set (9 files outside original subset, 2–10 speakers)

Uh oh!

MoonMao42 commented Apr 4, 2026

DER improvement exploration

Uh oh!

WilliamTambellini commented May 7, 2026

Uh oh!

MoonMao42 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MoonMao42 commented Apr 3, 2026 •

edited

Loading