You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qwen2 models produce garbled output (repeated @ / token ID 31) when using the ggml WebGPU backend in the browser. Other architectures (TinyLlama/Llama) work correctly on the same setup.
Environment
Browser: Chrome 146, Dia (Chromium-based) — same result on both
GPU: Apple Metal-3 (M-series Mac)
WebGPU adapter: vendor: "apple", arch: "metal-3", features include shader-f16, subgroups
All Qwen2 models produce identical garbage. The same GGUF files work correctly on CPU (WASM-only wllama) and local mlx-lm inference.
Suspected cause
Qwen2-1.5B has dimensions that differ from Llama:
num_attention_heads: 12 (not a power of 2)
num_key_value_heads: 2 (GQA ratio 6:1)
hidden_size: 1536 (not a power of 2)
intermediate_size: 8960
rope_freq_base: 1000000
TinyLlama has num_attention_heads: 32, num_key_value_heads: 4 (GQA 8:1), hidden_size: 2048 — all power-of-2 dimensions. This suggests the WebGPU matmul or attention shaders may have an issue with non-power-of-2 head counts or hidden dimensions.
Steps to reproduce
Build wllama with ggml WebGPU (reeselevine/wllama master branch)
Load any Qwen2 GGUF model with preferWebGPU: true
Generate text — output will be repeated @ characters
Description
Qwen2 models produce garbled output (repeated
@/ token ID 31) when using the ggml WebGPU backend in the browser. Other architectures (TinyLlama/Llama) work correctly on the same setup.Environment
vendor: "apple",arch: "metal-3", features includeshader-f16,subgroupsModels tested
@@@All Qwen2 models produce identical garbage. The same GGUF files work correctly on CPU (WASM-only wllama) and local mlx-lm inference.
Suspected cause
Qwen2-1.5B has dimensions that differ from Llama:
num_attention_heads: 12(not a power of 2)num_key_value_heads: 2(GQA ratio 6:1)hidden_size: 1536(not a power of 2)intermediate_size: 8960rope_freq_base: 1000000TinyLlama has
num_attention_heads: 32,num_key_value_heads: 4(GQA 8:1),hidden_size: 2048— all power-of-2 dimensions. This suggests the WebGPU matmul or attention shaders may have an issue with non-power-of-2 head counts or hidden dimensions.Steps to reproduce
preferWebGPU: true@charactersExpected behavior
Coherent text output matching CPU inference.
cc @reeselevine
Edit: my coding agent posted this during a debugging run without asking me 😬 Feel free to ignore if irrelevant