You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend the existing SYCL reorder optimization to support Q8_0
quantization. The reorder separates scale factors from weight data
for coalesced memory access — previously only Q4_0, Q4_K, and Q6_K
were supported.
On Intel Arc Pro B70 (Xe2/Battlemage), Q8_0 token generation improves
from 4.88 t/s to 15.24 t/s (3.1x) on Qwen3.5-27B. Memory bandwidth
utilization rises from 21% to 66% of theoretical maximum. Q8_0 is now
faster than Q6_K (15.24 vs 13.83 t/s) with higher quality.
Changes:
- quants.hpp: Add block_q_t<GGML_TYPE_Q8_0> reorder traits
- dequantize.hpp: Add dequantize_q8_0_reorder() for separated layout
- dmmv.cpp: Add Q8_0 DMMV reorder kernel and dispatch
- vecdotq.hpp: Add reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0>
- mmvq.cpp: Add Q8_0 MMVQ reorder kernel and dispatch
- ggml-sycl.cpp: Add reorder_qw_q8_0(), update dispatch and extra
allocation gate in ggml_backend_sycl_buffer_init_tensor()
Fixes: #21517
0 commit comments