Skip to content

whisper : validate vocab size and per-token length when loading model (#3674)#3767

Open
achyutbenz19 wants to merge 1 commit into
ggml-org:masterfrom
achyutbenz19:fix/3674-vocab-bounds
Open

whisper : validate vocab size and per-token length when loading model (#3674)#3767
achyutbenz19 wants to merge 1 commit into
ggml-org:masterfrom
achyutbenz19:fix/3674-vocab-bounds

Conversation

@achyutbenz19
Copy link
Copy Markdown

Summary

Fixes #3674.

whisper_model_load reads n_vocab (int32_t) and the per-token length (uint32_t) directly from the model file using read_safe, which reads raw bytes without any bounds check. A malformed or fuzzed model file (the reporter found one with AFL++ at 8 bytes) can set these values to e.g. 999999999 and 0xFFFFFFFF, which then feed into std::vector::resize. The allocation fails with std::bad_alloc, nothing catches it, and the process terminates via SIGABRT before any error is reported to the caller.

Scope of the change

src/whisper.cpp, whisper_model_load vocab block, +19/-5.

  1. Add two constexpr upper bounds:

    • max_n_vocab = 1 << 20 (1,048,576). Largest real Whisper models use ~52,000 tokens, so a million is generous.
    • max_word_len = 1 << 16 (65,536). Real vocab entries are typically a few bytes of BPE.
  2. After reading n_vocab, reject values outside [0, max_n_vocab] with a clear log line and return false.

  3. Inside the per-token loop, after reading len, reject values greater than max_word_len with a clear log line and return false.

Returning false is the documented failure path from whisper_model_load; the caller (whisper_init_from_file_with_params_no_state) already handles it and emits failed to load model to stderr.

Reproduction

Craft an 8-byte-ish malformed model file with a huge n_vocab:

import struct
with open('malformed.bin', 'wb') as f:
    f.write(struct.pack('<I', 0x67676d6c))  # GGML_FILE_MAGIC
    for _ in range(11): f.write(struct.pack('<i', 0))  # hparams, all zero
    f.write(struct.pack('<i', 0)); f.write(struct.pack('<i', 0))  # mel filters
    f.write(struct.pack('<i', 999999999))                          # n_vocab (huge)
    f.write(struct.pack('<I', 0xFFFFFFFF))                         # first vocab len (huge)

On current master (166c20b):

$ whisper-cli -m malformed.bin -f speech.wav
...
whisper_model_load: n_mels        = 0
whisper_model_load: ftype         = 0
whisper_model_load: type          = 0 (unknown)
^C    (hangs indefinitely on vocab vector resize or eventually SIGABRT)

With this patch:

$ whisper-cli -m malformed.bin -f speech.wav
...
whisper_model_load: type          = 0 (unknown)
whisper_model_load: invalid vocab size 999999999 (expected 0..1048576); malformed model file
whisper_init_with_params_no_state: failed to load model
error: failed to initialize whisper context

Returns cleanly. No SIGABRT, no hang.

Differential matrix

model ∈ {base, small}, fixture ∈ {speech-en, speech-ru, long-en-70s, zero-1.2s-16k}. 8 cells per build, all valid-model runs (no target cells: malformed-model handling is tested separately above).

cells target cells target improved target regressed non-target unchanged non-target changed
8 0 0 0 8 0

Every valid-model transcription is byte-identical before and after the patch. The bounds are wide enough that no real model can hit them.

What this does not do

  • Does not add validation to hparams or mel-filter reads upstream of vocab. Those are separate potential hardening targets; keeping scope to the reported crash site.
  • Does not catch every possible malformed-file shape. A file that passes the vocab bounds but has a truncated tensor section will still fail later; that is the correct failure (no aborts, returns false).
  • Does not change read_safe itself. It is still a raw read; the bounds live at the caller where semantic context is available.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix on valid models.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs against commit 166c20b on an Apple Silicon Mac. The malformed-model fixture and regress config are available; happy to share.

whisper_model_load reads n_vocab (int32) and per-token length (uint32)
directly from the model file with no bounds check. A malformed or fuzzed
model (e.g. an 8-byte AFL++ finding) can set these to values that cause
std::vector::resize to throw bad_alloc, which is uncaught and terminates
the process with SIGABRT (signal 6) before any error is reported.

Cap n_vocab at 2^20 tokens (real models top out around 52k) and each
per-token length at 2^16 bytes. On violation, log a clear error message
and return false so whisper_init_from_file_with_params_no_state can fail
gracefully.

Fixes ggml-org#3674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIGABRT in whisper_model_load: string constructed from invalid pointer when loading malformed model file

1 participant