Xid 43 GPU crash on RTX 5090 (Blackwell) after b8680 — flash_attn_stream_k_fixup kernel

## Summary

After upgrading from b8679 to b8685, llama-server crashes repeatedly with NVIDIA Xid 43 errors on RTX 5090 (Blackwell, sm_120a). Rolling back to b8679 resolves the issue completely. The only CUDA-related change in this range is #21159 (`flash_attn_stream_k_fixup` kernel optimization, merged as b8680).

## Environment

- **GPU**: NVIDIA RTX 5090 (GB202, sm_120a, 32GB GDDR7)
- **CUDA Toolkit**: 12.8 (`/usr/local/cuda-12.8`)
- **Driver**: nvidia-driver-580-open (580.126.20)
- **Build flags**: `-DCMAKE_CUDA_ARCHITECTURES="120"` (auto-converts to 120a), `-DGGML_CUDA_FA_ALL_QUANTS=ON`, `-DGGML_NATIVE=ON`
- **OS**: Ubuntu 24.04, kernel 6.17.0-20-generic
- **Model**: Qwen3.5-27B UD-Q6_K_XL (25.7GB), 131K context, q4_0 KV cache, flash attention enabled

## Reproduction

1. Build llama.cpp at b8685 (or any tag >= b8680) with CUDA 12.8 + sm_120a
2. Start `llama-server` with flash attention enabled and any model
3. Send a chat completion request
4. Server crashes with Xid 43

With b8679, the same configuration runs indefinitely without errors.

## dmesg output

```
2026-04-07T10:11:08,080132-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1049772, name=llama-server, channel 0x00000002
2026-04-07T10:11:29,913774-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1108679, name=llama-server, channel 0x00000002
2026-04-07T10:12:37,588699-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1108875, name=llama-server, channel 0x00000002
2026-04-07T10:13:21,041572-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109236, name=llama-server, channel 0x00000002
2026-04-07T10:13:41,759143-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109611, name=llama-server, channel 0x00000002
2026-04-07T10:14:48,131891-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109788, name=llama-server, channel 0x00000002
2026-04-07T10:19:15,071737-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1110167, name=llama-server, channel 0x00000002
2026-04-07T10:19:35,554712-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1112450, name=llama-server, channel 0x00000002
```

Each PID is a different llama-server instance (process manager auto-restarts after crash). 8 crashes in ~10 minutes.

## Bisect

```
b8679 (94ca829b6) — STABLE, no crashes
b8680 (15f786e65) — first tag with #21159 [CUDA] flash_attn_stream_k_fixup
b8685 (0988accf8) — crashes immediately
```

Commits between b8679 and b8685:
- `0988accf8` [SYCL] Q8_0 reorder (#21527) — not CUDA
- `0033f53a0` docs typo fix
- `d0a6dfeb2` WebGPU MUL_MAT_ID (#21147) — not CUDA
- `2e1f0a889` Q1_0 1-bit quant CPU (#21273) — not CUDA
- `506200cf8` CLI newline fix (#21485) — not CUDA
- **`15f786e65` [CUDA] flash_attn_stream_k_fixup (#21159) — only CUDA change**

## Notes

- Xid 43 = illegal memory access on the GPU (not OOM, not driver timeout)
- The Xid 8 errors from the previous day (different PIDs, different channel) were a separate issue with Gemma 4 mmproj, unrelated
- CUDA 12.8 builds with native sm_120a — no PTX JIT, no FORCE_CUBLAS
- This GPU has been stable on b8665–b8679 for weeks with the same model and configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xid 43 GPU crash on RTX 5090 (Blackwell) after b8680 — flash_attn_stream_k_fixup kernel #21564

Summary

Environment

Reproduction

dmesg output

Bisect

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Xid 43 GPU crash on RTX 5090 (Blackwell) after b8680 — flash_attn_stream_k_fixup kernel #21564

Description

Summary

Environment

Reproduction

dmesg output

Bisect

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions