[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM by PMZFX · Pull Request #21638 · ggml-org/llama.cpp

PMZFX · 2026-04-08T16:11:01Z

Summary

Fixes two issues with the Q8_0 reorder optimization introduced in #21527.

Bug 1: Garbage output from second prompt onward (#21589)

The Q8_0 reorder optimization rearranges weight data during token generation (batch=1, via DMMV/MMVQ), but the general GEMM dequantization path used during prompt processing was missing a reorder-aware variant for Q8_0. After the first tg pass reordered the weights, subsequent prompt processing read them with the standard dequantizer, producing corrupt output.

Q4_0, Q4_K, and Q6_K already had _reorder dequantizers in convert.cpp. Q8_0 was missing them.

Fix: Add dequantize_block_q8_0_reorder() in dequantize.hpp and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl() in convert.cpp.

Bug 2: Crash when device memory is full

The reorder functions allocate a temporary buffer the full size of the weight tensor via sycl_ext_malloc_device(). When VRAM is nearly full (large models on smaller cards), this returns NULL and the subsequent memcpy crashes.

Fix: Add a host memory fallback (sycl::malloc_host) when device allocation fails, and skip the reorder gracefully if both fail. The reorder flag is only set when the reorder actually succeeds.

Fixes #21589

Testing

Tested on Intel Arc Pro B70 (Xe2 Battlemage), oneAPI DPC++ 2025.3.3:

5 sequential prompts on Qwen3.5-9B Q8_0 via llama-server: all coherent
5 sequential prompts on Qwen3.5-35B-A3B Q4_K_M (mixed quant with Q8_0 tensors): all coherent
3 sequential prompts on Qwen2.5-1.5B Q8_0: all coherent
llama-bench confirms Q8_0 tg speedup is preserved (180 t/s)

Code was written with AI assistance (Claude), reviewed and tested by me on real hardware.

The Q8_0 reorder optimization (ggml-org#21527) was missing a reorder-aware dequantizer for the GEMM code path used during prompt processing. After token generation reordered Q8_0 weights (via DMMV/MMVQ), the next prompt processing pass would read them with the standard dequantizer, producing garbage output. Add dequantize_block_q8_0_reorder() and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the pattern already used by Q4_0, Q4_K, and Q6_K. Fixes ggml-org#21589 AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware.

The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes ggml-org#20478

arthw · 2026-04-09T05:07:21Z

ggml/src/ggml-sycl/ggml-sycl.cpp

-static void reorder_qw_q4_0(uint8_t * data_device, const int ncols, const int nrows, size_t size, size_t offset,
+// Try device allocation first; if VRAM is full, fall back to host memory so the
+// reorder kernel can still run (reading over PCIe instead of device-local).
+static inline void * sycl_ext_malloc_with_fallback(dpct::queue_ptr stream, size_t size, bool & host_fallback) {


device can access the host memory.
This feature will be supported by Ubuntu 26 (new linux kernel).
I suggest to add the macro to limit this code to enable for new linux(kernel), avoid to impact the user of old linux.

2.host_fallback
Suggest defining a new class to handle the memory malloc and free, use host_fallback as internal member variable. simple the usage.

Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free functions with sycl_reorder_temp_buffer RAII class. The host_fallback bool is now a private member, and cleanup happens automatically at scope exit. Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard the host memory fallback code path. Device access to host memory requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it. Addresses arthw's review on PR ggml-org#21638. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ack)

tunmaker

just tested on intel Arc A380 qwen3.5 2b q8
works for both thinking on and off
17t/s

NeoZhangJianyu

OK, it's good job!

Thank you!

arthw

Only one comment:
Please update the description of GGML_SYCL_HOST_MEM_FALLBACK in SYCL.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

arthw

It's good job!
The reorder of Q8_0 is attracting more attention. The issue is reported by several JIRAs.
This PR is fixed it quickly.

Additional suggestion:
It's encouraged to make the PR focus on the special issue. It will make the updated code be less and easy to be reviewed and approved.

Thank you!

DanStory · 2026-04-11T22:35:21Z

tested on Intel Arc Pro B70 (x4)
👍 unsloth/Qwen3-Coder-Next-GGUF:Q8_0 26.53 t/s
👍 unsloth/GLM-4.7-Flash-GGUF:Q8_0 23.30 t/s
👍 unsloth/Qwen3.5-35B-A3B-GGUF:Q8_0 31.81 t/s
👍 unsloth/gemma-4-31B-it-GGUF:Q8_0 11.89 t/s

arthw · 2026-04-13T01:03:39Z

@ggerganov
Could you review this PR?

Thank you!

PMZFX · 2026-04-13T14:49:51Z

Thanks for catching this! Investigated and fixed.

Added reorder-aware DMMV dequantizers for Q4_K. Also found and fixed the same gap in Q6_K while I was in there.

Tested on B70:

test-backend-ops -o MUL_MAT passes for all 4 reorder types (q4_0, q8_0, q4_K, q6_K)
Q4_K_M model (same one you tested): pp512 1039 t/s, tg128 58.5 t/s, multi-turn works
Q6_K and Q8_0 benchmarks clean, no regressions

Could you retest with your Q4_K_M model when you get a chance?

Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not DMMV. When the DMMV path encountered reordered data it would abort. Add DMMV kernels that read from the SOA reorder layout for both types. Same math as the non-reorder versions, different memory access pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

markussiebert · 2026-04-13T17:20:37Z

Works great, hope to see this merged soon :-)

PMZFX requested a review from a team as a code owner April 8, 2026 16:11

This was referenced Apr 8, 2026

Eval bug: SYCL: Qwen3.5 spitting garbage on the second prompt #21589

Open

[SYCL] Add Q8_0 reorder optimization for Intel GPUs (~3x token generation speedup) #21527

Merged

PMZFX changed the title ~~[SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM~~ [SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM Apr 8, 2026

PMZFX mentioned this pull request Apr 8, 2026

SYCL: fix reorder crash when device memory is full #21618

Closed

5 tasks

arthw reviewed Apr 9, 2026

View reviewed changes

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 9, 2026

PMZFX mentioned this pull request Apr 9, 2026

[SYCL] Use native subgroup size for K-quant DMMV kernels on Intel #21700

Open

1 task

markussiebert pushed a commit to markussiebert/llama.cpp that referenced this pull request Apr 10, 2026

sycl: apply PR ggml-org#21638 (Q8_0 reorder dequantizer + alloc fallb…

796c430

…ack)

NeoZhangJianyu mentioned this pull request Apr 10, 2026

Eval bug: Second interaction hangs / generates gibberish after commit 0988acc ([SYCL] Add Q8_0 reorder optimization) #21715

Open

tunmaker mentioned this pull request Apr 10, 2026

Eval bug: server-intel produces gibberish at second turn #21734

Open

tunmaker reviewed Apr 10, 2026

View reviewed changes

NeoZhangJianyu approved these changes Apr 11, 2026

View reviewed changes

arthw reviewed Apr 11, 2026

View reviewed changes

SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md

b76d87c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added the documentation Improvements or additions to documentation label Apr 11, 2026

arthw approved these changes Apr 11, 2026

View reviewed changes

arthw requested a review from ggerganov April 13, 2026 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM#21638

[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM#21638
PMZFX wants to merge 5 commits intoggml-org:masterfrom
PMZFX:fix/q8_0-reorder-dequantize

PMZFX commented Apr 8, 2026 •

edited

Loading

Uh oh!

arthw Apr 9, 2026

Uh oh!

tunmaker left a comment

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

arthw left a comment

Uh oh!

arthw left a comment

Uh oh!

DanStory commented Apr 11, 2026

Uh oh!

arthw commented Apr 13, 2026

Uh oh!

PMZFX commented Apr 13, 2026

Uh oh!

markussiebert commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

PMZFX commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

arthw Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

tunmaker left a comment

Choose a reason for hiding this comment

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

DanStory commented Apr 11, 2026

Uh oh!

arthw commented Apr 13, 2026

Uh oh!

PMZFX commented Apr 13, 2026

Uh oh!

markussiebert commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PMZFX commented Apr 8, 2026 •

edited

Loading