Prevent the sum of the dequantized activation in q8_1 from overflowing#21652
Prevent the sum of the dequantized activation in q8_1 from overflowing#21652bartowski1182 wants to merge 3 commits intoggml-org:masterfrom
Conversation
|
Can you dump the BF16 values of the problematic tensor? I also noticed some irregularities in this specific model in #20668 (comment) To me it looks like the model data is not sound, so I don't think patching the code is warranted. |
|
@ggerganov Yeah sure, and it's the same tensor that you noted in that eval bug. I added the debugging code back so you can see this: For the BF16 weights, ran a similar command to what you ran in the linked report: ./build/bin/llama-debug -m Mistral-Small-4-bf16.gguf -p "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]" -n 1 --tensor-filter "ffn_moe_weighted-32And with Q4_0 (with ffn_down set to Q4_1) in case it's relevant: ./build/bin/llama-debug -m Mistral-Small-4-Q4_0.gguf -p "[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST]" -n 1 --tensor-filter "ffn_moe_weighted-32(this was without my changes so it asserted) If the model data is not sound, not sure where to go from here, though this clamping does make it run and doesn't affect any sound model.. But I totally understand not wanting to put arbitrary code that masks bugs with the model itself, so more than happy to hear your personal judgement |
Overview
During Mistral 4 small quantization and subsequent testing, I found that the PPL of
Q4_1ended up withNaNWhen testing the reason, it only happened when later
FFN_DOWNlayers were quantized toQ4_1, IE:Works as expected, but:
(note the --tensor-type ffn_down=q4_1) gets
NaNwith PPLAfter digging around with Claude and debug code, found that 16
Q8_1blocks haves = Infbecause the fp16 value is overflowingIn Claude's words:
Additional information
I ran the same model with the updated activation code and yielded a PPL of
5.5535 +/- 0.1235For completeness, also tested with ignoring the pre-computed
svalue and recalculating the results asf32, and got a PPL of5.5725 +/- 0.12469Note that in either case, the PPL without this change was
NaN, so while this clamping is lossy, it does result in a model that produces literally anything at all instead of failing spectacularlyNote that this only updates the reference, AVX2, AVX1, and CUDA implementations, not familiar enough with the other archs to touch those
Mistral 4 small PPL before these changes
Mistral 4 small PPL after these changes
Also tested on a
Q4_1quant of Qwen 3.5 9B and got identical PPL results both with and without this changeQwen 3.5 9B before these changes
Qwen 3.5 9B before these changes
Requirements