whisper : guard against hallucinations on zero-filled input (#1881) by achyutbenz19 · Pull Request #3762 · ggml-org/whisper.cpp

achyutbenz19 · 2026-04-18T23:58:36Z

Summary

When a language is forced (e.g. -l ru, -l es), a zero-filled WAV produces a language-specific hallucination: [музыка] on -l ru, [Música] on -l es, [Musica] on -l it, etc. The auto-detect path does not have this problem: it naturally emits [BLANK_AUDIO] for silent input.

Root cause: forcing a language biases the decoder toward language-specific fallback tokens (bracketed music tags, filler words) rather than the blank-audio pattern the auto-detect path can reach. The is_no_speech gate at line 7585 does not fire because no_speech_prob stays below the 0.6 threshold when the model confidently emits a language-specific "music" token.

The fix adds a small early-return inside whisper_full_with_state: when a specific language is forced and the input PCM is entirely zero-valued, emit a single [BLANK_AUDIO] segment spanning the input duration and return. The auto-detect path is intentionally not touched.

Reproduction

Against current master (166c20b), with ggml-base.bin and a 1.2 s zero-filled 16 kHz WAV:

whisper-cli -m ggml-base.bin -l ru -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:10.000]   [музыка]

whisper-cli -m ggml-base.bin -l es -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:10.000]   [Música]

whisper-cli -m ggml-base.bin -l it -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:07.000]   [Musica]

whisper-cli -m ggml-base.bin -mc 0 zero-1.2s-16k.wav   # control: auto-detect
[00:00:00.000 --> 00:00:10.000]   [BLANK_AUDIO]

With this patch:

whisper-cli -m ggml-base.bin -l ru -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:01.200]   [BLANK_AUDIO]

whisper-cli -m ggml-base.bin -l es -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:01.200]   [BLANK_AUDIO]

Duration now reflects the actual audio length rather than the 10 s decoder window, addressing the second symptom in the original report.

Differential matrix

Ran the patched build against master across (model) x (fixture) x (lang) to confirm non-target code paths do not change:

cells	target cells	target improved	target equal-length change	non-target unchanged	non-target changed
56	18	13	5	38	0

Axes: model ∈ {base, small}, fixture ∈ {zero-1.2s, silence-3s, silence-10s, speech-en, speech-ru, long-en-70s, en-speech+10s-silence}, lang ∈ {auto, ru, en, es}.

Non-target cells (38 of them) are unchanged byte-for-byte between master and this patch. Real speech, pink-noise adjacent, long-form, speech-with-trailing-silence: none are affected. The guard only fires when language_is_forced && is_all_zero(samples).

The 5 cells flagged as "equal-length change" are cases where the before-and-after strings are the same length but different content; a length-only heuristic cannot tell a correct replacement from a regression, so listing them here for manual inspection:

model	fixture	lang	before	after
base	zero-1.2s	en	`[00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO]`	`[00:00:00.000 --> 00:00:01.200] [BLANK_AUDIO]`
base	silence-3s	en	`[00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO]`	`[00:00:00.000 --> 00:00:03.000] [BLANK_AUDIO]`
small	zero-1.2s	ru	`[00:00:00.000 --> 00:00:02.000] Редактор субтитров А.Семкин`	`[00:00:00.000 --> 00:00:01.200] [BLANK_AUDIO]`
small	silence-3s	ru	`[00:00:00.000 --> 00:00:10.000] [музыка]`	`[00:00:00.000 --> 00:00:03.000] [BLANK_AUDIO]`
small	silence-10s	ru	`[00:00:00.000 --> 00:00:05.000] Редактор субтитров А.Синецкая Корректор А.Егорова`	`[00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO]`

Two cases are timestamp corrections where the text was already [BLANK_AUDIO]; three replace a hallucination with [BLANK_AUDIO]. All five are the intended behavior of the fix.

What this does not do

The auto-detect path is left untouched. Zero-filled input with no language flag still goes through the existing decode path, which emits [BLANK_AUDIO] with the 10 s window timestamp. Narrowing that is a separate change.
Near-zero but non-zero input is not affected. The guard uses a strict != 0.0f check so floating-point drift in quieter audio is not mistaken for silence.
Partially-silent audio is not affected. The guard only fires when the entire input is zero-valued, so a real recording with a silent tail still goes through normal decoding.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. The matrix numbers above come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. Reproducer fixtures and commands are available; happy to share the regress config.

The auto-detect path naturally emits [BLANK_AUDIO] for silent input, but a forced language (-l ru, -l es, etc.) biases the decoder toward language-specific fallback tokens (Cyrillic music tags on -l ru, [Música] on -l es, etc.) instead. Detect zero-filled input up-front inside whisper_full_with_state and emit a single [BLANK_AUDIO] segment spanning the input duration, before mel-spectrogram computation and decoding. Fixes ggml-org#1881

achyutbenz19 mentioned this pull request Apr 19, 2026

whisper : skip decoding of zero-filled chunks on forced-language path (#1724) #3763

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : guard against hallucinations on zero-filled input (#1881)#3762

whisper : guard against hallucinations on zero-filled input (#1881)#3762
achyutbenz19 wants to merge 1 commit into
ggml-org:masterfrom
achyutbenz19:fix/1881-zero-wav-guard

achyutbenz19 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achyutbenz19 commented Apr 18, 2026

Summary

Reproduction

Differential matrix

What this does not do

Tools used

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant