[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM#21638
[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM#21638PMZFX wants to merge 5 commits intoggml-org:masterfrom
Conversation
The Q8_0 reorder optimization (ggml-org#21527) was missing a reorder-aware dequantizer for the GEMM code path used during prompt processing. After token generation reordered Q8_0 weights (via DMMV/MMVQ), the next prompt processing pass would read them with the standard dequantizer, producing garbage output. Add dequantize_block_q8_0_reorder() and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the pattern already used by Q4_0, Q4_K, and Q6_K. Fixes ggml-org#21589 AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware.
The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes ggml-org#20478
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
| static void reorder_qw_q4_0(uint8_t * data_device, const int ncols, const int nrows, size_t size, size_t offset, | ||
| // Try device allocation first; if VRAM is full, fall back to host memory so the | ||
| // reorder kernel can still run (reading over PCIe instead of device-local). | ||
| static inline void * sycl_ext_malloc_with_fallback(dpct::queue_ptr stream, size_t size, bool & host_fallback) { |
There was a problem hiding this comment.
- device can access the host memory.
This feature will be supported by Ubuntu 26 (new linux kernel).
I suggest to add the macro to limit this code to enable for new linux(kernel), avoid to impact the user of old linux.
2.host_fallback
Suggest defining a new class to handle the memory malloc and free, use host_fallback as internal member variable. simple the usage.
Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free functions with sycl_reorder_temp_buffer RAII class. The host_fallback bool is now a private member, and cleanup happens automatically at scope exit. Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard the host memory fallback code path. Device access to host memory requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it. Addresses arthw's review on PR ggml-org#21638. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tunmaker
left a comment
There was a problem hiding this comment.
just tested on intel Arc A380 qwen3.5 2b q8
works for both thinking on and off
17t/s
NeoZhangJianyu
left a comment
There was a problem hiding this comment.
OK, it's good job!
Thank you!
arthw
left a comment
There was a problem hiding this comment.
Only one comment:
Please update the description of GGML_SYCL_HOST_MEM_FALLBACK in SYCL.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arthw
left a comment
There was a problem hiding this comment.
It's good job!
The reorder of Q8_0 is attracting more attention. The issue is reported by several JIRAs.
This PR is fixed it quickly.
Additional suggestion:
It's encouraged to make the PR focus on the special issue. It will make the updated code be less and easy to be reviewed and approved.
Thank you!
|
tested on Intel Arc Pro B70 (x4) |
|
@ggerganov Thank you! |
|
Thanks for catching this! Investigated and fixed. Added reorder-aware DMMV dequantizers for Q4_K. Also found and fixed the same gap in Q6_K while I was in there. Tested on B70:
Could you retest with your Q4_K_M model when you get a chance? |
Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not DMMV. When the DMMV path encountered reordered data it would abort. Add DMMV kernels that read from the SOA reorder layout for both types. Same math as the non-reorder versions, different memory access pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Works great, hope to see this merged soon :-) |
Summary
Fixes two issues with the Q8_0 reorder optimization introduced in #21527.
Bug 1: Garbage output from second prompt onward (#21589)
The Q8_0 reorder optimization rearranges weight data during token generation (batch=1, via DMMV/MMVQ), but the general GEMM dequantization path used during prompt processing was missing a reorder-aware variant for Q8_0. After the first tg pass reordered the weights, subsequent prompt processing read them with the standard dequantizer, producing corrupt output.
Q4_0, Q4_K, and Q6_K already had
_reorderdequantizers inconvert.cpp. Q8_0 was missing them.Fix: Add
dequantize_block_q8_0_reorder()indequantize.hppand wire it into bothggml_get_to_fp16_sycl()andggml_get_to_fp32_sycl()inconvert.cpp.Bug 2: Crash when device memory is full
The reorder functions allocate a temporary buffer the full size of the weight tensor via
sycl_ext_malloc_device(). When VRAM is nearly full (large models on smaller cards), this returns NULL and the subsequent memcpy crashes.Fix: Add a host memory fallback (
sycl::malloc_host) when device allocation fails, and skip the reorder gracefully if both fail. The reorder flag is only set when the reorder actually succeeds.Fixes #21589
Testing
Tested on Intel Arc Pro B70 (Xe2 Battlemage), oneAPI DPC++ 2025.3.3:
Code was written with AI assistance (Claude), reviewed and tested by me on real hardware.