Skip to content

Xid 43 GPU crash on RTX 5090 (Blackwell) after b8680 — flash_attn_stream_k_fixup kernel #21564

@lance0

Description

@lance0

Summary

After upgrading from b8679 to b8685, llama-server crashes repeatedly with NVIDIA Xid 43 errors on RTX 5090 (Blackwell, sm_120a). Rolling back to b8679 resolves the issue completely. The only CUDA-related change in this range is #21159 (flash_attn_stream_k_fixup kernel optimization, merged as b8680).

Environment

  • GPU: NVIDIA RTX 5090 (GB202, sm_120a, 32GB GDDR7)
  • CUDA Toolkit: 12.8 (/usr/local/cuda-12.8)
  • Driver: nvidia-driver-580-open (580.126.20)
  • Build flags: -DCMAKE_CUDA_ARCHITECTURES="120" (auto-converts to 120a), -DGGML_CUDA_FA_ALL_QUANTS=ON, -DGGML_NATIVE=ON
  • OS: Ubuntu 24.04, kernel 6.17.0-20-generic
  • Model: Qwen3.5-27B UD-Q6_K_XL (25.7GB), 131K context, q4_0 KV cache, flash attention enabled

Reproduction

  1. Build llama.cpp at b8685 (or any tag >= b8680) with CUDA 12.8 + sm_120a
  2. Start llama-server with flash attention enabled and any model
  3. Send a chat completion request
  4. Server crashes with Xid 43

With b8679, the same configuration runs indefinitely without errors.

dmesg output

2026-04-07T10:11:08,080132-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1049772, name=llama-server, channel 0x00000002
2026-04-07T10:11:29,913774-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1108679, name=llama-server, channel 0x00000002
2026-04-07T10:12:37,588699-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1108875, name=llama-server, channel 0x00000002
2026-04-07T10:13:21,041572-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109236, name=llama-server, channel 0x00000002
2026-04-07T10:13:41,759143-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109611, name=llama-server, channel 0x00000002
2026-04-07T10:14:48,131891-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109788, name=llama-server, channel 0x00000002
2026-04-07T10:19:15,071737-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1110167, name=llama-server, channel 0x00000002
2026-04-07T10:19:35,554712-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1112450, name=llama-server, channel 0x00000002

Each PID is a different llama-server instance (process manager auto-restarts after crash). 8 crashes in ~10 minutes.

Bisect

b8679 (94ca829b6) — STABLE, no crashes
b8680 (15f786e65) — first tag with #21159 [CUDA] flash_attn_stream_k_fixup
b8685 (0988accf8) — crashes immediately

Commits between b8679 and b8685:

Notes

  • Xid 43 = illegal memory access on the GPU (not OOM, not driver timeout)
  • The Xid 8 errors from the previous day (different PIDs, different channel) were a separate issue with Gemma 4 mmproj, unrelated
  • CUDA 12.8 builds with native sm_120a — no PTX JIT, no FORCE_CUBLAS
  • This GPU has been stable on b8665–b8679 for weeks with the same model and configuration

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions