Skip to content

whisper : guard against hallucinations on zero-filled input (#1881)#3762

Open
achyutbenz19 wants to merge 1 commit into
ggml-org:masterfrom
achyutbenz19:fix/1881-zero-wav-guard
Open

whisper : guard against hallucinations on zero-filled input (#1881)#3762
achyutbenz19 wants to merge 1 commit into
ggml-org:masterfrom
achyutbenz19:fix/1881-zero-wav-guard

Conversation

@achyutbenz19
Copy link
Copy Markdown

Summary

Fixes #1881.

When a language is forced (e.g. -l ru, -l es), a zero-filled WAV produces a language-specific hallucination: [музыка] on -l ru, [Música] on -l es, [Musica] on -l it, etc. The auto-detect path does not have this problem: it naturally emits [BLANK_AUDIO] for silent input.

Root cause: forcing a language biases the decoder toward language-specific fallback tokens (bracketed music tags, filler words) rather than the blank-audio pattern the auto-detect path can reach. The is_no_speech gate at line 7585 does not fire because no_speech_prob stays below the 0.6 threshold when the model confidently emits a language-specific "music" token.

The fix adds a small early-return inside whisper_full_with_state: when a specific language is forced and the input PCM is entirely zero-valued, emit a single [BLANK_AUDIO] segment spanning the input duration and return. The auto-detect path is intentionally not touched.

Reproduction

Against current master (166c20b), with ggml-base.bin and a 1.2 s zero-filled 16 kHz WAV:

whisper-cli -m ggml-base.bin -l ru -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:10.000]   [музыка]

whisper-cli -m ggml-base.bin -l es -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:10.000]   [Música]

whisper-cli -m ggml-base.bin -l it -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:07.000]   [Musica]

whisper-cli -m ggml-base.bin -mc 0 zero-1.2s-16k.wav   # control: auto-detect
[00:00:00.000 --> 00:00:10.000]   [BLANK_AUDIO]

With this patch:

whisper-cli -m ggml-base.bin -l ru -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:01.200]   [BLANK_AUDIO]

whisper-cli -m ggml-base.bin -l es -mc 0 zero-1.2s-16k.wav
[00:00:00.000 --> 00:00:01.200]   [BLANK_AUDIO]

Duration now reflects the actual audio length rather than the 10 s decoder window, addressing the second symptom in the original report.

Differential matrix

Ran the patched build against master across (model) x (fixture) x (lang) to confirm non-target code paths do not change:

cells target cells target improved target equal-length change non-target unchanged non-target changed
56 18 13 5 38 0

Axes: model ∈ {base, small}, fixture ∈ {zero-1.2s, silence-3s, silence-10s, speech-en, speech-ru, long-en-70s, en-speech+10s-silence}, lang ∈ {auto, ru, en, es}.

Non-target cells (38 of them) are unchanged byte-for-byte between master and this patch. Real speech, pink-noise adjacent, long-form, speech-with-trailing-silence: none are affected. The guard only fires when language_is_forced && is_all_zero(samples).

The 5 cells flagged as "equal-length change" are cases where the before-and-after strings are the same length but different content; a length-only heuristic cannot tell a correct replacement from a regression, so listing them here for manual inspection:

model fixture lang before after
base zero-1.2s en [00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO] [00:00:00.000 --> 00:00:01.200] [BLANK_AUDIO]
base silence-3s en [00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO] [00:00:00.000 --> 00:00:03.000] [BLANK_AUDIO]
small zero-1.2s ru [00:00:00.000 --> 00:00:02.000] Редактор субтитров А.Семкин [00:00:00.000 --> 00:00:01.200] [BLANK_AUDIO]
small silence-3s ru [00:00:00.000 --> 00:00:10.000] [музыка] [00:00:00.000 --> 00:00:03.000] [BLANK_AUDIO]
small silence-10s ru [00:00:00.000 --> 00:00:05.000] Редактор субтитров А.Синецкая Корректор А.Егорова [00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO]

Two cases are timestamp corrections where the text was already [BLANK_AUDIO]; three replace a hallucination with [BLANK_AUDIO]. All five are the intended behavior of the fix.

What this does not do

  • The auto-detect path is left untouched. Zero-filled input with no language flag still goes through the existing decode path, which emits [BLANK_AUDIO] with the 10 s window timestamp. Narrowing that is a separate change.
  • Near-zero but non-zero input is not affected. The guard uses a strict != 0.0f check so floating-point drift in quieter audio is not mistaken for silence.
  • Partially-silent audio is not affected. The guard only fires when the entire input is zero-valued, so a real recording with a silent tail still goes through normal decoding.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. The matrix numbers above come from actual runs on an Apple Silicon Mac against commit 166c20b of this repo and a patched build. Reproducer fixtures and commands are available; happy to share the regress config.

The auto-detect path naturally emits [BLANK_AUDIO] for silent input,
but a forced language (-l ru, -l es, etc.) biases the decoder toward
language-specific fallback tokens (Cyrillic music tags on -l ru,
[Música] on -l es, etc.) instead.

Detect zero-filled input up-front inside whisper_full_with_state and
emit a single [BLANK_AUDIO] segment spanning the input duration,
before mel-spectrogram computation and decoding.

Fixes ggml-org#1881
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zero-filled WAV give hallucination and wrong duration

1 participant