Skip to content

[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM#21638

Open
PMZFX wants to merge 5 commits intoggml-org:masterfrom
PMZFX:fix/q8_0-reorder-dequantize
Open

[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM#21638
PMZFX wants to merge 5 commits intoggml-org:masterfrom
PMZFX:fix/q8_0-reorder-dequantize

Conversation

@PMZFX
Copy link
Copy Markdown
Contributor

@PMZFX PMZFX commented Apr 8, 2026

Summary

Fixes two issues with the Q8_0 reorder optimization introduced in #21527.

Bug 1: Garbage output from second prompt onward (#21589)

The Q8_0 reorder optimization rearranges weight data during token generation (batch=1, via DMMV/MMVQ), but the general GEMM dequantization path used during prompt processing was missing a reorder-aware variant for Q8_0. After the first tg pass reordered the weights, subsequent prompt processing read them with the standard dequantizer, producing corrupt output.

Q4_0, Q4_K, and Q6_K already had _reorder dequantizers in convert.cpp. Q8_0 was missing them.

Fix: Add dequantize_block_q8_0_reorder() in dequantize.hpp and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl() in convert.cpp.

Bug 2: Crash when device memory is full

The reorder functions allocate a temporary buffer the full size of the weight tensor via sycl_ext_malloc_device(). When VRAM is nearly full (large models on smaller cards), this returns NULL and the subsequent memcpy crashes.

Fix: Add a host memory fallback (sycl::malloc_host) when device allocation fails, and skip the reorder gracefully if both fail. The reorder flag is only set when the reorder actually succeeds.

Fixes #21589

Testing

Tested on Intel Arc Pro B70 (Xe2 Battlemage), oneAPI DPC++ 2025.3.3:

  • 5 sequential prompts on Qwen3.5-9B Q8_0 via llama-server: all coherent
  • 5 sequential prompts on Qwen3.5-35B-A3B Q4_K_M (mixed quant with Q8_0 tensors): all coherent
  • 3 sequential prompts on Qwen2.5-1.5B Q8_0: all coherent
  • llama-bench confirms Q8_0 tg speedup is preserved (180 t/s)

Code was written with AI assistance (Claude), reviewed and tested by me on real hardware.

The Q8_0 reorder optimization (ggml-org#21527) was missing a reorder-aware
dequantizer for the GEMM code path used during prompt processing.
After token generation reordered Q8_0 weights (via DMMV/MMVQ), the
next prompt processing pass would read them with the standard
dequantizer, producing garbage output.

Add dequantize_block_q8_0_reorder() and wire it into both
ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the
pattern already used by Q4_0, Q4_K, and Q6_K.

Fixes ggml-org#21589

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.
The reorder optimization allocates a temporary buffer the full size of
the weight tensor on the device. When VRAM is nearly full (large models
on a single GPU), this allocation fails and the subsequent memcpy crashes
on a NULL pointer.

Fix: try device allocation first, fall back to host memory if device
memory is full. The reorder kernel still works correctly reading from
host memory over PCIe. This is slower for the one-time reorder (~21 t/s
vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for
all subsequent inference. If both device and host allocation fail, skip
the reorder and fall back to the unoptimized kernel path.

Also fixes a bug where opt_for_reorder() marked tensors as reordered
even when the reorder was skipped due to allocation failure. This caused
DMMV/MMVQ kernels to read the original AoS data as if it were SoA,
producing garbage output or NaN results.

Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was
AI-assisted (Claude), reviewed and tested on hardware by a human.

Fixes ggml-org#20478
@PMZFX PMZFX changed the title [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM [SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM Apr 8, 2026
static void reorder_qw_q4_0(uint8_t * data_device, const int ncols, const int nrows, size_t size, size_t offset,
// Try device allocation first; if VRAM is full, fall back to host memory so the
// reorder kernel can still run (reading over PCIe instead of device-local).
static inline void * sycl_ext_malloc_with_fallback(dpct::queue_ptr stream, size_t size, bool & host_fallback) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. device can access the host memory.
    This feature will be supported by Ubuntu 26 (new linux kernel).
    I suggest to add the macro to limit this code to enable for new linux(kernel), avoid to impact the user of old linux.

2.host_fallback
Suggest defining a new class to handle the memory malloc and free, use host_fallback as internal member variable. simple the usage.

Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free
functions with sycl_reorder_temp_buffer RAII class. The host_fallback
bool is now a private member, and cleanup happens automatically at
scope exit.

Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard
the host memory fallback code path. Device access to host memory
requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels
can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it.

Addresses arthw's review on PR ggml-org#21638.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 9, 2026
markussiebert pushed a commit to markussiebert/llama.cpp that referenced this pull request Apr 10, 2026
Copy link
Copy Markdown

@tunmaker tunmaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just tested on intel Arc A380 qwen3.5 2b q8
works for both thinking on and off
17t/s

Copy link
Copy Markdown
Contributor

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's good job!

Thank you!

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment:
Please update the description of GGML_SYCL_HOST_MEM_FALLBACK in SYCL.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 11, 2026
Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!
The reorder of Q8_0 is attracting more attention. The issue is reported by several JIRAs.
This PR is fixed it quickly.

Additional suggestion:
It's encouraged to make the PR focus on the special issue. It will make the updated code be less and easy to be reviewed and approved.

Thank you!

@DanStory
Copy link
Copy Markdown

tested on Intel Arc Pro B70 (x4)
👍 unsloth/Qwen3-Coder-Next-GGUF:Q8_0 26.53 t/s
👍 unsloth/GLM-4.7-Flash-GGUF:Q8_0 23.30 t/s
👍 unsloth/Qwen3.5-35B-A3B-GGUF:Q8_0 31.81 t/s
👍 unsloth/gemma-4-31B-it-GGUF:Q8_0 11.89 t/s

@arthw
Copy link
Copy Markdown
Contributor

arthw commented Apr 13, 2026

@ggerganov
Could you review this PR?

Thank you!

@arthw arthw requested a review from ggerganov April 13, 2026 13:41
@PMZFX
Copy link
Copy Markdown
Contributor Author

PMZFX commented Apr 13, 2026

Thanks for catching this! Investigated and fixed.

Added reorder-aware DMMV dequantizers for Q4_K. Also found and fixed the same gap in Q6_K while I was in there.

Tested on B70:

  • test-backend-ops -o MUL_MAT passes for all 4 reorder types (q4_0, q8_0, q4_K, q6_K)
  • Q4_K_M model (same one you tested): pp512 1039 t/s, tg128 58.5 t/s, multi-turn works
  • Q6_K and Q8_0 benchmarks clean, no regressions

Could you retest with your Q4_K_M model when you get a chance?

Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not
DMMV. When the DMMV path encountered reordered data it would abort.

Add DMMV kernels that read from the SOA reorder layout for both
types. Same math as the non-reorder versions, different memory
access pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@markussiebert
Copy link
Copy Markdown

Works great, hope to see this merged soon :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: SYCL: Qwen3.5 spitting garbage on the second prompt

6 participants