whisper : guard against hallucinations on zero-filled input (#1881)#3762
Open
achyutbenz19 wants to merge 1 commit into
Open
whisper : guard against hallucinations on zero-filled input (#1881)#3762achyutbenz19 wants to merge 1 commit into
achyutbenz19 wants to merge 1 commit into
Conversation
The auto-detect path naturally emits [BLANK_AUDIO] for silent input, but a forced language (-l ru, -l es, etc.) biases the decoder toward language-specific fallback tokens (Cyrillic music tags on -l ru, [Música] on -l es, etc.) instead. Detect zero-filled input up-front inside whisper_full_with_state and emit a single [BLANK_AUDIO] segment spanning the input duration, before mel-spectrogram computation and decoding. Fixes ggml-org#1881
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1881.
When a language is forced (e.g.
-l ru,-l es), a zero-filled WAV produces a language-specific hallucination:[музыка]on-l ru,[Música]on-l es,[Musica]on-l it, etc. The auto-detect path does not have this problem: it naturally emits[BLANK_AUDIO]for silent input.Root cause: forcing a language biases the decoder toward language-specific fallback tokens (bracketed music tags, filler words) rather than the blank-audio pattern the auto-detect path can reach. The
is_no_speechgate at line 7585 does not fire becauseno_speech_probstays below the 0.6 threshold when the model confidently emits a language-specific "music" token.The fix adds a small early-return inside
whisper_full_with_state: when a specific language is forced and the input PCM is entirely zero-valued, emit a single[BLANK_AUDIO]segment spanning the input duration and return. The auto-detect path is intentionally not touched.Reproduction
Against current master (
166c20b), withggml-base.binand a 1.2 s zero-filled 16 kHz WAV:With this patch:
Duration now reflects the actual audio length rather than the 10 s decoder window, addressing the second symptom in the original report.
Differential matrix
Ran the patched build against
masteracross(model) x (fixture) x (lang)to confirm non-target code paths do not change:Axes:
model ∈ {base, small},fixture ∈ {zero-1.2s, silence-3s, silence-10s, speech-en, speech-ru, long-en-70s, en-speech+10s-silence},lang ∈ {auto, ru, en, es}.Non-target cells (38 of them) are unchanged byte-for-byte between
masterand this patch. Real speech, pink-noise adjacent, long-form, speech-with-trailing-silence: none are affected. The guard only fires whenlanguage_is_forced && is_all_zero(samples).The 5 cells flagged as "equal-length change" are cases where the before-and-after strings are the same length but different content; a length-only heuristic cannot tell a correct replacement from a regression, so listing them here for manual inspection:
[00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO][00:00:00.000 --> 00:00:01.200] [BLANK_AUDIO][00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO][00:00:00.000 --> 00:00:03.000] [BLANK_AUDIO][00:00:00.000 --> 00:00:02.000] Редактор субтитров А.Семкин[00:00:00.000 --> 00:00:01.200] [BLANK_AUDIO][00:00:00.000 --> 00:00:10.000] [музыка][00:00:00.000 --> 00:00:03.000] [BLANK_AUDIO][00:00:00.000 --> 00:00:05.000] Редактор субтитров А.Синецкая Корректор А.Егорова[00:00:00.000 --> 00:00:10.000] [BLANK_AUDIO]Two cases are timestamp corrections where the text was already
[BLANK_AUDIO]; three replace a hallucination with[BLANK_AUDIO]. All five are the intended behavior of the fix.What this does not do
[BLANK_AUDIO]with the 10 s window timestamp. Narrowing that is a separate change.!= 0.0fcheck so floating-point drift in quieter audio is not mistaken for silence.Tools used
git,cmake,whisper-cli, andaudiokitfor the differential matrix.Disclosure
I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. The matrix numbers above come from actual runs on an Apple Silicon Mac against commit
166c20bof this repo and a patched build. Reproducer fixtures and commands are available; happy to share the regress config.