SYCL: fix reorder crash when device memory is full by PMZFX · Pull Request #21618 · ggml-org/llama.cpp

PMZFX · 2026-04-08T11:18:43Z

Summary

The reorder optimization (AoS to SoA weight layout, added in [SYCL] Add Q8_0 reorder optimization for Intel GPUs (~3x token generation speedup) #21527) allocates a temp buffer the size of the entire weight tensor. On GPUs where the model fills most of VRAM, this allocation fails and crashes with a NULL pointer memcpy.
Fix adds a host memory fallback: if device allocation fails, the temp buffer is allocated in system RAM instead. The reorder kernel reads from host memory over PCIe, which is slower for the one-time transform but preserves the optimization for all subsequent tokens.
Also fixes a bug where opt_for_reorder() unconditionally set the reorder flag even when the reorder was skipped. This caused the DMMV/MMVQ kernels to interpret AoS data as SoA, producing garbage output.

Details

Reported in #20478. Root cause is that sycl_ext_malloc_device returns NULL when VRAM is too full for the tensor-sized temp buffer. The crash happens on the first token generation (DMMV/MMVQ path), not during prompt processing (GEMM path, which doesn't reorder).

The fix affects all four reorder functions (Q4_0, Q8_0, Q4_K, Q6_K). Two new helpers (sycl_ext_malloc_with_fallback, sycl_ext_free_fallback) keep the per-function changes minimal.

Test plan

Q8_0 9B, single GPU, device reorder path (37.7 t/s tg)
Q8_0 9B, forced host fallback path (21.3 t/s tg, correct output)
Q4_K_M 27B, single GPU (16.0 t/s tg)
Q8_0 9B, llama-bench pp512/tg32 (2554/48 t/s, GEMM path unaffected)
Q8_0 9B with GGML_SYCL_DISABLE_OPT=1 (16.9 t/s tg, reorder skipped correctly)

Tested on Intel Arc Pro B70 (32GB). AI-assisted development (Claude). Code reviewed and tested on my hardware.

The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes ggml-org#20478

PMZFX · 2026-04-08T16:19:30Z

Folded into #21638 along with the GEMM dequantize fix. One PR for both Q8_0 reorder issues.

PMZFX requested a review from a team as a code owner April 8, 2026 11:18

PMZFX closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: fix reorder crash when device memory is full#21618

SYCL: fix reorder crash when device memory is full#21618
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:sycl-fix-reorder-oom

PMZFX commented Apr 8, 2026

Uh oh!

PMZFX commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PMZFX commented Apr 8, 2026

Summary

Details

Test plan

Uh oh!

PMZFX commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant