After upgrading from b8679 to b8685, llama-server crashes repeatedly with NVIDIA Xid 43 errors on RTX 5090 (Blackwell, sm_120a). Rolling back to b8679 resolves the issue completely. The only CUDA-related change in this range is #21159 (flash_attn_stream_k_fixup kernel optimization, merged as b8680).
With b8679, the same configuration runs indefinitely without errors.
2026-04-07T10:11:08,080132-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1049772, name=llama-server, channel 0x00000002
2026-04-07T10:11:29,913774-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1108679, name=llama-server, channel 0x00000002
2026-04-07T10:12:37,588699-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1108875, name=llama-server, channel 0x00000002
2026-04-07T10:13:21,041572-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109236, name=llama-server, channel 0x00000002
2026-04-07T10:13:41,759143-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109611, name=llama-server, channel 0x00000002
2026-04-07T10:14:48,131891-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1109788, name=llama-server, channel 0x00000002
2026-04-07T10:19:15,071737-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1110167, name=llama-server, channel 0x00000002
2026-04-07T10:19:35,554712-04:00 NVRM: Xid (PCI:0000:41:00): 43, pid=1112450, name=llama-server, channel 0x00000002
Each PID is a different llama-server instance (process manager auto-restarts after crash). 8 crashes in ~10 minutes.
b8679 (94ca829b6) — STABLE, no crashes
b8680 (15f786e65) — first tag with #21159 [CUDA] flash_attn_stream_k_fixup
b8685 (0988accf8) — crashes immediately
Summary
After upgrading from b8679 to b8685, llama-server crashes repeatedly with NVIDIA Xid 43 errors on RTX 5090 (Blackwell, sm_120a). Rolling back to b8679 resolves the issue completely. The only CUDA-related change in this range is #21159 (
flash_attn_stream_k_fixupkernel optimization, merged as b8680).Environment
/usr/local/cuda-12.8)-DCMAKE_CUDA_ARCHITECTURES="120"(auto-converts to 120a),-DGGML_CUDA_FA_ALL_QUANTS=ON,-DGGML_NATIVE=ONReproduction
llama-serverwith flash attention enabled and any modelWith b8679, the same configuration runs indefinitely without errors.
dmesg output
Each PID is a different llama-server instance (process manager auto-restarts after crash). 8 crashes in ~10 minutes.
Bisect
Commits between b8679 and b8685:
0988accf8[SYCL] Q8_0 reorder ([SYCL] Add Q8_0 reorder optimization for Intel GPUs (~3x token generation speedup) #21527) — not CUDA0033f53a0docs typo fixd0a6dfeb2WebGPU MUL_MAT_ID (ggml-webgpu: Add the support ofMUL_MAT_ID#21147) — not CUDA2e1f0a889Q1_0 1-bit quant CPU (ggml: add Q1_0 1-bit quantization support (CPU) #21273) — not CUDA506200cf8CLI newline fix (console: fix stripping of \n in multiline input #21485) — not CUDA15f786e65[CUDA] flash_attn_stream_k_fixup ([CUDA ] Write an optimized flash_attn_stream_k_fixup kernel #21159) — only CUDA changeNotes