SYCL: fix reorder crash when device memory is full#21618
Closed
PMZFX wants to merge 1 commit intoggml-org:masterfrom
Closed
SYCL: fix reorder crash when device memory is full#21618PMZFX wants to merge 1 commit intoggml-org:masterfrom
PMZFX wants to merge 1 commit intoggml-org:masterfrom
Conversation
The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes ggml-org#20478
Contributor
Author
|
Folded into #21638 along with the GEMM dequantize fix. One PR for both Q8_0 reorder issues. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
opt_for_reorder()unconditionally set the reorder flag even when the reorder was skipped. This caused the DMMV/MMVQ kernels to interpret AoS data as SoA, producing garbage output.Details
Reported in #20478. Root cause is that
sycl_ext_malloc_devicereturns NULL when VRAM is too full for the tensor-sized temp buffer. The crash happens on the first token generation (DMMV/MMVQ path), not during prompt processing (GEMM path, which doesn't reorder).The fix affects all four reorder functions (Q4_0, Q8_0, Q4_K, Q6_K). Two new helpers (
sycl_ext_malloc_with_fallback,sycl_ext_free_fallback) keep the per-function changes minimal.Test plan
Tested on Intel Arc Pro B70 (32GB). AI-assisted development (Claude). Code reviewed and tested on my hardware.