WebGPU: Qwen2 models produce garbled output (repeated @ token)

## Description

Qwen2 models produce garbled output (repeated `@` / token ID 31) when using the ggml WebGPU backend in the browser. Other architectures (TinyLlama/Llama) work correctly on the same setup.

## Environment

- **Browser**: Chrome 146, Dia (Chromium-based) — same result on both
- **GPU**: Apple Metal-3 (M-series Mac)
- **WebGPU adapter**: `vendor: "apple"`, `arch: "metal-3"`, features include `shader-f16`, `subgroups`
- **Wllama fork**: reeselevine/wllama master branch (PR #201 to ngxson/wllama)
- **JSPI**: Available and used

## Models tested

| Model | GGUF | Output |
|-------|------|--------|
| TinyLlama-1.1B-Chat Q4_K_M | TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF | ✅ Coherent |
| **Qwen2.5-1.5B-Instruct Q4_K_M** | Qwen/Qwen2.5-1.5B-Instruct-GGUF | ❌ Repeats `@` |
| **Qwen2.5-1.5B-Instruct Q8_0** | Custom fine-tune | ❌ Repeats `@` |
| **Qwen2.5-1.5B-Instruct Q4_K_M** | Custom fine-tune | ❌ Repeats `@` |

All Qwen2 models produce identical garbage. The same GGUF files work correctly on CPU (WASM-only wllama) and local mlx-lm inference.

## Suspected cause

Qwen2-1.5B has dimensions that differ from Llama:
- `num_attention_heads: 12` (not a power of 2)
- `num_key_value_heads: 2` (GQA ratio 6:1)
- `hidden_size: 1536` (not a power of 2)
- `intermediate_size: 8960`
- `rope_freq_base: 1000000`

TinyLlama has `num_attention_heads: 32`, `num_key_value_heads: 4` (GQA 8:1), `hidden_size: 2048` — all power-of-2 dimensions. This suggests the WebGPU matmul or attention shaders may have an issue with non-power-of-2 head counts or hidden dimensions.

## Steps to reproduce

1. Build wllama with ggml WebGPU (reeselevine/wllama master branch)
2. Load any Qwen2 GGUF model with `preferWebGPU: true`
3. Generate text — output will be repeated `@` characters

## Expected behavior

Coherent text output matching CPU inference.

cc @reeselevine

Edit: my coding agent posted this during a debugging run without asking me 😬 Feel free to ignore if irrelevant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU: Qwen2 models produce garbled output (repeated @ token) #21602

Description

Environment

Models tested

Suspected cause

Steps to reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	GGUF	Output
TinyLlama-1.1B-Chat Q4_K_M	TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF	✅ Coherent
Qwen2.5-1.5B-Instruct Q4_K_M	Qwen/Qwen2.5-1.5B-Instruct-GGUF	❌ Repeats `@`
Qwen2.5-1.5B-Instruct Q8_0	Custom fine-tune	❌ Repeats `@`
Qwen2.5-1.5B-Instruct Q4_K_M	Custom fine-tune	❌ Repeats `@`

WebGPU: Qwen2 models produce garbled output (repeated @ token) #21602

Description

Description

Environment

Models tested

Suspected cause

Steps to reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions