fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
Conversation
The Q1_0_g128 x86 kernel has two bugs causing gibberish output at 0.25 tok/s on Intel CPUs: 1. Float-to-int truncation: the per-block accumulator was `int`, truncating `d1 * sumi_block` (float * int → float → int). Each Q8_0 block's scale factor was rounded to 0 or ±1, destroying the output. Fix: `float block_sum` accumulator. 2. No SIMD: the x86 path was scalar-only while ARM NEON had full vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float. Results on i5-1135G7 with Bonsai 8B: Before (MSVC): 0.25 tok/s, gibberish output Bug fix only: 3.7 tok/s, correct output Bug fix + AVX2: 6.9 tok/s, correct output Both the x86-specific kernel (arch/x86/quants.c) and the generic fallback (quants.c) are fixed.
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 |
|
Running 8B on i5 box with this PR, I get consistent After swapped out I get consistent Below alternative code on i5 Broadwell gives: The 0.0002 KLD seems persistent on AVX2 across different basic implementations. |
|
the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM. AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as |
KL Divergence Results — x86 AVX2 (PR #7)Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: System info: Test setup
x86 AVX2 Divergences
Comparison with PR #8 reference (ARM NEON / generic scalar)
The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern ( Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04). |
|
Hi @wildcattrio , I updated the implementation and here is the combined result. You were right xor+sub gives slightly better KLD with good tps. I also tried other impl for tps, the best were on par, but this is the simplest.
|
There was a problem hiding this comment.
Pull request overview
Fixes incorrect output and improves performance for the Q1_0_g128 × Q8_0 x86 CPU dot-product kernel by correcting float accumulation and adding an AVX2 vectorized implementation aligned with existing bit-expansion patterns in the x86 quant kernels.
Changes:
- Fix float-to-int truncation by switching per-block accumulation to
floatin the generic kernel and x86 scalar fallback. - Add an AVX2 implementation for
ggml_vec_dot_q1_0_g128_q8_0using broadcast/shuffle/bit-test expansion andmul_sum_i8_pairs_float().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ggml/src/ggml-cpu/quants.c |
Fixes scalar generic accumulation type to prevent truncation and incorrect results. |
ggml/src/ggml-cpu/arch/x86/quants.c |
Adds AVX2 vectorized path and fixes scalar fallback accumulation type for x86. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const __m256i qy = _mm256_loadu_si256((const __m256i *)yb->qs); | ||
|
|
||
| // Get 4 bytes of bits for this Q8_0 block | ||
| const uint32_t bits32 = *(const uint32_t *)&x[ib].qs[k * 4]; |
There was a problem hiding this comment.
bits32 is loaded via a uint32_t* cast from x[ib].qs (*(const uint32_t *)&x[ib].qs[k * 4]), which can violate strict-aliasing rules and may be unaligned. Prefer copying into a local uint32_t with memcpy (similar to bytes_from_bits_32() earlier in this file) to safely preserve the bit pattern under optimization.
| const uint32_t bits32 = *(const uint32_t *)&x[ib].qs[k * 4]; | |
| uint32_t bits32; | |
| memcpy(&bits32, &x[ib].qs[k * 4], sizeof(bits32)); |
|
Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp) Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128. https://github.com/PrismML-Eng/llama.cpp/tree/master This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers). |
|
There is a lot of CPU PRs, planning to gether all in one and then send to the main llama.cpp |
) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Summary
The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:
Bug 1: Float-to-int truncation (causes gibberish)
The per-block accumulator is
int, butd1 * sumi_blockproduces afloat. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.Bug 2: No SIMD (causes 0.25 tok/s)
The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same
broadcast → shuffle → cmpeq → mul_sum_i8_pairs_floatpattern from the existingggml_vec_dot_q1_0_q8_0kernel.Results (i5-1135G7, 32GB, Bonsai 8B)
, with is it. and the. and the.... in.........Files changed
ggml/src/ggml-cpu/arch/x86/quants.c— AVX2 kernel + scalar fixggml/src/ggml-cpu/quants.c— generic scalar fallback fixTest plan