feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1
Conversation
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1 - mmvq.cpp: dispatch functions + switch cases for both types - convert.cpp: dequantize functions for fp16/fp32 conversion paths Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF. Achieves ~46 tok/s generation with 37/37 layers on SYCL0. Fixes: fatal error: unsupport data type=q1_0_g128
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1 - mmvq.cpp: dispatch functions + switch cases for both types - convert.cpp: dequantize functions for fp16/fp32 conversion paths Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF. Achieves ~46 tok/s generation with 37/37 layers on SYCL0. Fixes: fatal error: unsupport data type=q1_0_g128
|
@timothycarambat happy to discuss any changes needed |
|
Hi, I will give your two PRs a try locally for the Arc 770 as I am only getting around 23 tokens a second compared to ~63 tokens/s on a RX 6600 through HIP rocm. Took a while to get the original code building to test on my Arc 770 by upgrading to the latest oneapi toolkit and debugging by removing opencl-mesa package which seemed to clash. Not sure if it is deferring to the intel opencl backends over oneapi yet. Report back after testing your patches against the latest toolkit intel-oneapi-toolkit-2026.0.0.198 |
|
Got 43-60 tokens/s ONEAPI version: intel-oneapi-toolkit-2026.0.0.198 Token variation possibly due to clocking up and down since linux has limited smi controls for consumer intel arc gpus (whilst they have smi controls for flex cards which share the same/similar architecture, might try them later), longer tasks get ~60 tokens/s |
|
These changes not only allow you to run prisimml 1bit but also 1.58 bit files. I have tested it with both bonsai models the 1bit and the 1.58 one. The 1.58 one is a little slower. I have not tested it with other 1.58 bitnet models but I guess it would work. I use it with currently with the 1.58 8b bonsai model. In the near future we should also get models larger than 8b with this tech. Imagine a qwen 3.6 27b on this tech. Would be fast and accurate enough on a lower end gpu. |
|
Indeed. Spotted this for IQ1_S, https://huggingface.co/Tom9000/TheProfessor-155b-GUFF-Q1-v02 |
|
@timothycarambat — bumping this for review. The PR has independent community validation: @wancoder built it on Arc 770 + Artix Linux with oneAPI 2026.0.0.198 and measured 43–60 tok/s on Bonsai-8B, up from ~23 tok/s on the base. Roughly 2–3× speedup with no accuracy change. Also confirmed working with the 1.58bit Bonsai variant. PR is mergeable, no conflicts. |
Summary
Adds SYCL/oneAPI compute kernels for the
Q1_0andQ1_0_g128quantizationtypes used by PrismML Bonsai 8B models, enabling full GPU offload on Intel Arc
GPUs via the Intel oneAPI Level Zero backend.
Without this patch:
With this patch, all 37 model layers offload to the GPU and inference runs at
~43-46 tok/s on an Intel Arc Pro B50 (BMG G21).
Changes
ggml/src/ggml-sycl/vecdotq.hppvec_dot_q1_0_q8_1— dot product for Q1_0 (32-weight blocks)vec_dot_q1_0_g128_q8_1— dot product for Q1_0_g128 (128-weight blocks)ggml-cuda/vecdotq.cuhbit=1 → +d,bit=0 → -d; Q8_1 scale factor applied correctly per blockggml/src/ggml-sycl/mmvq.cppmul_mat_vec_q1_0_q8_1_sycldispatch functionmul_mat_vec_q1_0_g128_q8_1_sycldispatch functionGGML_TYPE_Q1_0andGGML_TYPE_Q1_0_g128cases to the switch inggml_sycl_op_mul_mat_vec_qggml/src/ggml-sycl/convert.cppdequantize_row_q1_0_sycldequantize_row_q1_0_g128_syclggml_get_to_fp16_syclandggml_get_to_fp32_syclTest Hardware
Test Model
prism-ml/Bonsai-8B-gguf(Q1_0_g128, 1.08 GB)Results
Notes
Battlemage lacks
VK_KHR_shader_integer_dot_product. SYCL is currentlythe only working GPU path on Arc for Bonsai models.
--reasoning offis required to suppress the embedded Qwen3 thinkingtemplate in Bonsai GGUFs.