Skip to content

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1

Open
vrwallace wants to merge 2 commits into
Mintplex-Labs:prismfrom
vrwallace:feat/sycl-q1_0-q1_0_g128-kernels
Open

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1
vrwallace wants to merge 2 commits into
Mintplex-Labs:prismfrom
vrwallace:feat/sycl-q1_0-q1_0_g128-kernels

Conversation

@vrwallace
Copy link
Copy Markdown

Summary

Adds SYCL/oneAPI compute kernels for the Q1_0 and Q1_0_g128 quantization
types used by PrismML Bonsai 8B models, enabling full GPU offload on Intel Arc
GPUs via the Intel oneAPI Level Zero backend.

Without this patch:

fatal error: unsupport data type=q1_0_g128
Aborted

With this patch, all 37 model layers offload to the GPU and inference runs at
~43-46 tok/s on an Intel Arc Pro B50 (BMG G21).


Changes

ggml/src/ggml-sycl/vecdotq.hpp

  • Added vec_dot_q1_0_q8_1 — dot product for Q1_0 (32-weight blocks)
  • Added vec_dot_q1_0_g128_q8_1 — dot product for Q1_0_g128 (128-weight blocks)
  • Ported from PrismML CUDA kernels in ggml-cuda/vecdotq.cuh
  • bit=1 → +d, bit=0 → -d; Q8_1 scale factor applied correctly per block

ggml/src/ggml-sycl/mmvq.cpp

  • Added mul_mat_vec_q1_0_q8_1_sycl dispatch function
  • Added mul_mat_vec_q1_0_g128_q8_1_sycl dispatch function
  • Added GGML_TYPE_Q1_0 and GGML_TYPE_Q1_0_g128 cases to the switch in
    ggml_sycl_op_mul_mat_vec_q

ggml/src/ggml-sycl/convert.cpp

  • Added dequantize_row_q1_0_sycl
  • Added dequantize_row_q1_0_g128_sycl
  • Added cases to both ggml_get_to_fp16_sycl and ggml_get_to_fp32_sycl

Test Hardware

  • GPU: Intel Arc Pro B50 (BMG G21, Battlemage)
  • Driver: Mesa ANV 25.2.8 / Intel oneAPI Level Zero
  • OS: Ubuntu Noble 24.04, kernel 6.17

Test Model

  • prism-ml/Bonsai-8B-gguf (Q1_0_g128, 1.08 GB)

Results

load_tensors:   CPU_Mapped  =    83.31 MiB  (non-quantized tensors only)
load_tensors:   SYCL0       =  1015.99 MiB  (all Q1_0_g128 layers on GPU)
offloaded 37/37 layers to GPU

prompt:     ~55 tok/s
generation: ~46 tok/s

Notes

  • Vulkan backend does NOT work for Q1_0_g128 on Intel Arc — Mesa ANV for
    Battlemage lacks VK_KHR_shader_integer_dot_product. SYCL is currently
    the only working GPU path on Arc for Bonsai models.
  • --reasoning off is required to suppress the embedded Qwen3 thinking
    template in Bonsai GGUFs.

vrwallace added 2 commits May 3, 2026 12:46
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1
- mmvq.cpp: dispatch functions + switch cases for both types
- convert.cpp: dequantize functions for fp16/fp32 conversion paths

Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF.
Achieves ~46 tok/s generation with 37/37 layers on SYCL0.

Fixes: fatal error: unsupport data type=q1_0_g128
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1
- mmvq.cpp: dispatch functions + switch cases for both types
- convert.cpp: dequantize functions for fp16/fp32 conversion paths

Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF.
Achieves ~46 tok/s generation with 37/37 layers on SYCL0.

Fixes: fatal error: unsupport data type=q1_0_g128
@vrwallace
Copy link
Copy Markdown
Author

@timothycarambat happy to discuss any changes needed

@wancoder
Copy link
Copy Markdown

wancoder commented May 15, 2026

Hi, I will give your two PRs a try locally for the Arc 770 as I am only getting around 23 tokens a second compared to ~63 tokens/s on a RX 6600 through HIP rocm.

Took a while to get the original code building to test on my Arc 770 by upgrading to the latest oneapi toolkit and debugging by removing opencl-mesa package which seemed to clash. Not sure if it is deferring to the intel opencl backends over oneapi yet.

Report back after testing your patches against the latest toolkit intel-oneapi-toolkit-2026.0.0.198

@wancoder
Copy link
Copy Markdown

wancoder commented May 16, 2026

Got 43-60 tokens/s
on GPU Arc 770
with MODEL prism-ml/Bonsai-8B-gguf
in OS artix linux

ONEAPI version: intel-oneapi-toolkit-2026.0.0.198

Token variation possibly due to clocking up and down since linux has limited smi controls for consumer intel arc gpus (whilst they have smi controls for flex cards which share the same/similar architecture, might try them later), longer tasks get ~60 tokens/s

@vrwallace
Copy link
Copy Markdown
Author

These changes not only allow you to run prisimml 1bit but also 1.58 bit files. I have tested it with both bonsai models the 1bit and the 1.58 one. The 1.58 one is a little slower. I have not tested it with other 1.58 bitnet models but I guess it would work.

I use it with currently with the 1.58 8b bonsai model.

In the near future we should also get models larger than 8b with this tech. Imagine a qwen 3.6 27b on this tech.

Would be fast and accurate enough on a lower end gpu.

@wancoder
Copy link
Copy Markdown

Indeed.

Spotted this for IQ1_S, https://huggingface.co/Tom9000/TheProfessor-155b-GUFF-Q1-v02

@vrwallace
Copy link
Copy Markdown
Author

@timothycarambat — bumping this for review. The PR has independent community validation: @wancoder built it on Arc 770 + Artix Linux with oneAPI 2026.0.0.198 and measured 43–60 tok/s on Bonsai-8B, up from ~23 tok/s on the base. Roughly 2–3× speedup with no accuracy change.

Also confirmed working with the 1.58bit Bonsai variant. PR is mergeable, no conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants