whisper : validate vocab size and per-token length when loading model (#3674) by achyutbenz19 · Pull Request #3767 · ggml-org/whisper.cpp

achyutbenz19 · 2026-04-19T01:30:18Z

Summary

whisper_model_load reads n_vocab (int32_t) and the per-token length (uint32_t) directly from the model file using read_safe, which reads raw bytes without any bounds check. A malformed or fuzzed model file (the reporter found one with AFL++ at 8 bytes) can set these values to e.g. 999999999 and 0xFFFFFFFF, which then feed into std::vector::resize. The allocation fails with std::bad_alloc, nothing catches it, and the process terminates via SIGABRT before any error is reported to the caller.

Scope of the change

src/whisper.cpp, whisper_model_load vocab block, +19/-5.

Add two constexpr upper bounds:
- max_n_vocab = 1 << 20 (1,048,576). Largest real Whisper models use ~52,000 tokens, so a million is generous.
- max_word_len = 1 << 16 (65,536). Real vocab entries are typically a few bytes of BPE.
After reading n_vocab, reject values outside [0, max_n_vocab] with a clear log line and return false.
Inside the per-token loop, after reading len, reject values greater than max_word_len with a clear log line and return false.

Returning false is the documented failure path from whisper_model_load; the caller (whisper_init_from_file_with_params_no_state) already handles it and emits failed to load model to stderr.

Reproduction

Craft an 8-byte-ish malformed model file with a huge n_vocab:

import struct
with open('malformed.bin', 'wb') as f:
    f.write(struct.pack('<I', 0x67676d6c))  # GGML_FILE_MAGIC
    for _ in range(11): f.write(struct.pack('<i', 0))  # hparams, all zero
    f.write(struct.pack('<i', 0)); f.write(struct.pack('<i', 0))  # mel filters
    f.write(struct.pack('<i', 999999999))                          # n_vocab (huge)
    f.write(struct.pack('<I', 0xFFFFFFFF))                         # first vocab len (huge)

On current master (166c20b):

$ whisper-cli -m malformed.bin -f speech.wav
...
whisper_model_load: n_mels        = 0
whisper_model_load: ftype         = 0
whisper_model_load: type          = 0 (unknown)
^C    (hangs indefinitely on vocab vector resize or eventually SIGABRT)

With this patch:

$ whisper-cli -m malformed.bin -f speech.wav
...
whisper_model_load: type          = 0 (unknown)
whisper_model_load: invalid vocab size 999999999 (expected 0..1048576); malformed model file
whisper_init_with_params_no_state: failed to load model
error: failed to initialize whisper context

Returns cleanly. No SIGABRT, no hang.

Differential matrix

model ∈ {base, small}, fixture ∈ {speech-en, speech-ru, long-en-70s, zero-1.2s-16k}. 8 cells per build, all valid-model runs (no target cells: malformed-model handling is tested separately above).

cells	target cells	target improved	target regressed	non-target unchanged	non-target changed
8	0	0	0	8	0

Every valid-model transcription is byte-identical before and after the patch. The bounds are wide enough that no real model can hit them.

What this does not do

Does not add validation to hparams or mel-filter reads upstream of vocab. Those are separate potential hardening targets; keeping scope to the reported crash site.
Does not catch every possible malformed-file shape. A file that passes the vocab bounds but has a truncated tensor section will still fail later; that is the correct failure (no aborts, returns false).
Does not change read_safe itself. It is still a raw read; the bounds live at the caller where semantic context is available.

Tools used

git, cmake, whisper-cli, and audiokit for the differential matrix on valid models.

Disclosure

I am an AI assistant (Anthropic's Claude) helping a user contribute this fix. Numbers above come from actual runs against commit 166c20b on an Apple Silicon Mac. The malformed-model fixture and regress config are available; happy to share.

whisper_model_load reads n_vocab (int32) and per-token length (uint32) directly from the model file with no bounds check. A malformed or fuzzed model (e.g. an 8-byte AFL++ finding) can set these to values that cause std::vector::resize to throw bad_alloc, which is uncaught and terminates the process with SIGABRT (signal 6) before any error is reported. Cap n_vocab at 2^20 tokens (real models top out around 52k) and each per-token length at 2^16 bytes. On violation, log a clear error message and return false so whisper_init_from_file_with_params_no_state can fail gracefully. Fixes ggml-org#3674

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : validate vocab size and per-token length when loading model (#3674)#3767

whisper : validate vocab size and per-token length when loading model (#3674)#3767
achyutbenz19 wants to merge 1 commit into
ggml-org:masterfrom
achyutbenz19:fix/3674-vocab-bounds

achyutbenz19 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achyutbenz19 commented Apr 19, 2026

Summary

Scope of the change

Reproduction

Differential matrix

What this does not do

Tools used

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant