feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI) by vrwallace · Pull Request #1 · Mintplex-Labs/prism-ml-llama.cpp

vrwallace · 2026-05-03T17:51:17Z

Summary

Adds SYCL/oneAPI compute kernels for the Q1_0 and Q1_0_g128 quantization
types used by PrismML Bonsai 8B models, enabling full GPU offload on Intel Arc
GPUs via the Intel oneAPI Level Zero backend.

Without this patch:

fatal error: unsupport data type=q1_0_g128
Aborted

With this patch, all 37 model layers offload to the GPU and inference runs at
~43-46 tok/s on an Intel Arc Pro B50 (BMG G21).

Changes

`ggml/src/ggml-sycl/vecdotq.hpp`

Added vec_dot_q1_0_q8_1 — dot product for Q1_0 (32-weight blocks)
Added vec_dot_q1_0_g128_q8_1 — dot product for Q1_0_g128 (128-weight blocks)
Ported from PrismML CUDA kernels in ggml-cuda/vecdotq.cuh
bit=1 → +d, bit=0 → -d; Q8_1 scale factor applied correctly per block

`ggml/src/ggml-sycl/mmvq.cpp`

Added mul_mat_vec_q1_0_q8_1_sycl dispatch function
Added mul_mat_vec_q1_0_g128_q8_1_sycl dispatch function
Added GGML_TYPE_Q1_0 and GGML_TYPE_Q1_0_g128 cases to the switch in
ggml_sycl_op_mul_mat_vec_q

`ggml/src/ggml-sycl/convert.cpp`

Added dequantize_row_q1_0_sycl
Added dequantize_row_q1_0_g128_sycl
Added cases to both ggml_get_to_fp16_sycl and ggml_get_to_fp32_sycl

Test Hardware

GPU: Intel Arc Pro B50 (BMG G21, Battlemage)
Driver: Mesa ANV 25.2.8 / Intel oneAPI Level Zero
OS: Ubuntu Noble 24.04, kernel 6.17

Test Model

prism-ml/Bonsai-8B-gguf (Q1_0_g128, 1.08 GB)

Results

load_tensors:   CPU_Mapped  =    83.31 MiB  (non-quantized tensors only)
load_tensors:   SYCL0       =  1015.99 MiB  (all Q1_0_g128 layers on GPU)
offloaded 37/37 layers to GPU

prompt:     ~55 tok/s
generation: ~46 tok/s

Notes

Vulkan backend does NOT work for Q1_0_g128 on Intel Arc — Mesa ANV for
Battlemage lacks VK_KHR_shader_integer_dot_product. SYCL is currently
the only working GPU path on Arc for Bonsai models.
--reasoning off is required to suppress the embedded Qwen3 thinking
template in Bonsai GGUFs.

- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1 - mmvq.cpp: dispatch functions + switch cases for both types - convert.cpp: dequantize functions for fp16/fp32 conversion paths Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF. Achieves ~46 tok/s generation with 37/37 layers on SYCL0. Fixes: fatal error: unsupport data type=q1_0_g128

vrwallace · 2026-05-03T22:50:43Z

@timothycarambat happy to discuss any changes needed

wancoder · 2026-05-15T19:08:15Z

Hi, I will give your two PRs a try locally for the Arc 770 as I am only getting around 23 tokens a second compared to ~63 tokens/s on a RX 6600 through HIP rocm.

Took a while to get the original code building to test on my Arc 770 by upgrading to the latest oneapi toolkit and debugging by removing opencl-mesa package which seemed to clash. Not sure if it is deferring to the intel opencl backends over oneapi yet.

Report back after testing your patches against the latest toolkit intel-oneapi-toolkit-2026.0.0.198

wancoder · 2026-05-16T01:45:58Z

Got 43-60 tokens/s
on GPU Arc 770
with MODEL prism-ml/Bonsai-8B-gguf
in OS artix linux

ONEAPI version: intel-oneapi-toolkit-2026.0.0.198

Token variation possibly due to clocking up and down since linux has limited smi controls for consumer intel arc gpus (whilst they have smi controls for flex cards which share the same/similar architecture, might try them later), longer tasks get ~60 tokens/s

vrwallace · 2026-05-17T17:19:23Z

These changes not only allow you to run prisimml 1bit but also 1.58 bit files. I have tested it with both bonsai models the 1bit and the 1.58 one. The 1.58 one is a little slower. I have not tested it with other 1.58 bitnet models but I guess it would work.

I use it with currently with the 1.58 8b bonsai model.

In the near future we should also get models larger than 8b with this tech. Imagine a qwen 3.6 27b on this tech.

Would be fast and accurate enough on a lower end gpu.

wancoder · 2026-05-18T04:51:15Z

Indeed.

Spotted this for IQ1_S, https://huggingface.co/Tom9000/TheProfessor-155b-GUFF-Q1-v02

vrwallace · 2026-05-18T20:22:55Z

@timothycarambat — bumping this for review. The PR has independent community validation: @wancoder built it on Arc 770 + Artix Linux with oneAPI 2026.0.0.198 and measured 43–60 tok/s on Bonsai-8B, up from ~23 tok/s on the base. Roughly 2–3× speedup with no accuracy change.

Also confirmed working with the 1.58bit Bonsai variant. PR is mergeable, no conflicts.

vrwallace added 2 commits May 3, 2026 12:46

vrwallace mentioned this pull request May 3, 2026

docs: Add Linux Intel Arc SYCL build and run instructions #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1
vrwallace wants to merge 2 commits into
Mintplex-Labs:prismfrom
vrwallace:feat/sycl-q1_0-q1_0_g128-kernels

vrwallace commented May 3, 2026

Uh oh!

vrwallace commented May 3, 2026

Uh oh!

wancoder commented May 15, 2026 •

edited

Loading

Uh oh!

wancoder commented May 16, 2026 •

edited

Loading

Uh oh!

vrwallace commented May 17, 2026

Uh oh!

wancoder commented May 18, 2026

Uh oh!

vrwallace commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vrwallace commented May 3, 2026

Summary

Changes

ggml/src/ggml-sycl/vecdotq.hpp

ggml/src/ggml-sycl/mmvq.cpp

ggml/src/ggml-sycl/convert.cpp

Test Hardware

Test Model

Results

Notes

Uh oh!

vrwallace commented May 3, 2026

Uh oh!

wancoder commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wancoder commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrwallace commented May 17, 2026

Uh oh!

wancoder commented May 18, 2026

Uh oh!

vrwallace commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`ggml/src/ggml-sycl/vecdotq.hpp`

`ggml/src/ggml-sycl/mmvq.cpp`

`ggml/src/ggml-sycl/convert.cpp`

wancoder commented May 15, 2026 •

edited

Loading

wancoder commented May 16, 2026 •

edited

Loading