Skip to content

SYCL: fix reorder crash when device memory is full#21618

Closed
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:sycl-fix-reorder-oom
Closed

SYCL: fix reorder crash when device memory is full#21618
PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX:sycl-fix-reorder-oom

Conversation

@PMZFX
Copy link
Copy Markdown
Contributor

@PMZFX PMZFX commented Apr 8, 2026

Summary

  • The reorder optimization (AoS to SoA weight layout, added in [SYCL] Add Q8_0 reorder optimization for Intel GPUs (~3x token generation speedup) #21527) allocates a temp buffer the size of the entire weight tensor. On GPUs where the model fills most of VRAM, this allocation fails and crashes with a NULL pointer memcpy.
  • Fix adds a host memory fallback: if device allocation fails, the temp buffer is allocated in system RAM instead. The reorder kernel reads from host memory over PCIe, which is slower for the one-time transform but preserves the optimization for all subsequent tokens.
  • Also fixes a bug where opt_for_reorder() unconditionally set the reorder flag even when the reorder was skipped. This caused the DMMV/MMVQ kernels to interpret AoS data as SoA, producing garbage output.

Details

Reported in #20478. Root cause is that sycl_ext_malloc_device returns NULL when VRAM is too full for the tensor-sized temp buffer. The crash happens on the first token generation (DMMV/MMVQ path), not during prompt processing (GEMM path, which doesn't reorder).

The fix affects all four reorder functions (Q4_0, Q8_0, Q4_K, Q6_K). Two new helpers (sycl_ext_malloc_with_fallback, sycl_ext_free_fallback) keep the per-function changes minimal.

Test plan

  • Q8_0 9B, single GPU, device reorder path (37.7 t/s tg)
  • Q8_0 9B, forced host fallback path (21.3 t/s tg, correct output)
  • Q4_K_M 27B, single GPU (16.0 t/s tg)
  • Q8_0 9B, llama-bench pp512/tg32 (2554/48 t/s, GEMM path unaffected)
  • Q8_0 9B with GGML_SYCL_DISABLE_OPT=1 (16.9 t/s tg, reorder skipped correctly)

Tested on Intel Arc Pro B70 (32GB). AI-assisted development (Claude). Code reviewed and tested on my hardware.

The reorder optimization allocates a temporary buffer the full size of
the weight tensor on the device. When VRAM is nearly full (large models
on a single GPU), this allocation fails and the subsequent memcpy crashes
on a NULL pointer.

Fix: try device allocation first, fall back to host memory if device
memory is full. The reorder kernel still works correctly reading from
host memory over PCIe. This is slower for the one-time reorder (~21 t/s
vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for
all subsequent inference. If both device and host allocation fail, skip
the reorder and fall back to the unoptimized kernel path.

Also fixes a bug where opt_for_reorder() marked tensors as reordered
even when the reorder was skipped due to allocation failure. This caused
DMMV/MMVQ kernels to read the original AoS data as if it were SoA,
producing garbage output or NaN results.

Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was
AI-assisted (Claude), reviewed and tested on hardware by a human.

Fixes ggml-org#20478
@PMZFX PMZFX requested a review from a team as a code owner April 8, 2026 11:18
@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 8, 2026

Folded into #21638 along with the GEMM dequantize fix. One PR for both Q8_0 reorder issues.

@PMZFX PMZFX closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant