Skip to content

(Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up)#21636

Open
pl752 wants to merge 2 commits intoggml-org:masterfrom
pl752:perf/q1_0_g128_no_nofma
Open

(Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up)#21636
pl752 wants to merge 2 commits intoggml-org:masterfrom
pl752:perf/q1_0_g128_no_nofma

Conversation

@pl752
Copy link
Copy Markdown
Contributor

@pl752 pl752 commented Apr 8, 2026

Hello, I have prepared optimized implementation of cpu q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of PrismML-Eng#10 PR, list of experiments conducted and some other benchmark results can be found there

This PR implements:

  • More efficient (less bit math and multiplications) generic implementation of dot product for (q1_0; q8_0)
  • x86 SIMD specific implementations of dot product for (q1_0; q8_0) for most of the realistic x86_64 targets (from SSSE3 to AVX2)

Checks performed so far:

  • test-quantization-fns works passes
  • model behaves well
  • perplexity runs completed for 5x512 batches of wikitext-2-test (unpacked gguf as a reference, Bonsai 1.7B)
  • llama-bench runs for Bonsai 1.7B
  • verified that assembly is efficient in terms of lack of register spills and good pipeline pressure
Benchmark results for Bonsai 1.7B

Benchmarks were performed with:

  • CPU: AMD Ryzen 5 7640HS (at 65w)
  • WSL vm
  • LPDDR5 @ 6400MT JEDEC
  • Threads: 10
Flow pp 512 t/s tg 128 t/s Speedup
Initial* 2.05 1.32 1.0x / 1.0x
Scalar 13.07 9.38 6.4x / 7.1x
SSSE3 43.43 32.56 21.2x / 24.6x
AVX 53.54 40.70 26.1x / 30.8x
AVX + F16C** 73.87 45.94 36.0x / 34.7x
AVX2 + FMA 131.03 73.85 63.9x / 55.9x
AVX512 137.75 76.91 67.1x / 58.2x

"*": Results for current mainline variant were extrapolated due to me being impatient
"**": F16C is enabled for AVX2/512 too and disabled previously (to reflect cpu ISA generations)

Perplexity summary for Bonsai 1.7B
Metric Scalar SSSE3 AVX AVX2 + FMA
Same top p 99.451 ± 0.207 % 99.059 ± 0.271 % 99.373 ± 0.221 % 99.686 ± 0.157 %
Mean KLD 0.000213 ± 0.000008 0.000228 ± 0.000010 0.000235 ± 0.000010 0.000218 ± 0.000009
Maximum KLD 0.004783 0.004070 0.004658 0.005173
99.9% KLD 0.002648 0.003666 0.003888 0.003778
99.0% KLD 0.001295 0.001730 0.001676 0.001318
Median KLD 0.000129 0.000141 0.000143 0.000134
1.0% KLD -0.000012 -0.000009 -0.000007 -0.000006
Minimum KLD -0.000051 -0.000040 -0.000057 -0.000045
Mean Δp 0.000 ± 0.009 % 0.011 ± 0.010 % 0.000 ± 0.010 % 0.011 ± 0.010 %
Maximum Δp 2.770 % 2.917 % 2.709 % 3.366 %
99.9% Δp 1.851 % 2.036 % 2.166 % 2.707 %
99.0% Δp 1.192 % 1.359 % 1.314 % 1.268 %
95.0% Δp 0.486 % 0.534 % 0.540 % 0.551 %
Median Δp -0.000 % 0.000 % 0.000 % 0.000 %
5.0% Δp -0.465 % -0.558 % -0.576 % -0.494 %
1.0% Δp -1.020 % -1.034 % -1.099 % -0.989 %
0.1% Δp -1.888 % -1.412 % -1.783 % -1.675 %
Minimum Δp -2.109 % -1.823 % -1.859 % -2.133 %
RMS Δp 0.334 ± 0.017 % 0.360 ± 0.018 % 0.362 ± 0.017 % 0.364 ± 0.022 %

Things still to be done:

  • Awaiting @khosravipasha approval
  • AVX512 implementation (I was unable to achieve meaningful improvements aside from opts from compiler) for Zen 4
  • Implementation for Zen 5 or modern Xeons as they have faster AVX512 pipeline
  • Implementing branches for nrc==2 as it shows potential for further speedup (pipeline is pretty hot in terms of memory bandwidth already), next PR soon probably
  • Maybe some experiments outside (repack -> specialized mmvq/mmq; experimenting with scratch buffer configurations)
  • I have risc-v sbc with vec size of 256 and fp/bf support (spacemit k1), so maybe future PR for risc-v SIMD (or even spacemit MMA?)

People who have also contributed

(other people who provided useful insights or experimented themselves)

AI usage disclosure

  • Was used for automating benchmarks, some of the tests and creating tables
  • Was NOT used to write any other text for PR or human interaction
  • Was used for prototyping and iteration (guided by me, final code was mostly manually refined and tested)

@pl752 pl752 marked this pull request as ready for review April 8, 2026 19:52
@pl752 pl752 requested a review from ggerganov as a code owner April 8, 2026 19:52
@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 8, 2026

Aaand, we are live Okay, reviews, requests and questions are welcome

@khosravipasha
Copy link
Copy Markdown
Contributor

Tested this on a x86 CPU I have access to, "AMD EPYC 7543 32-Core Processor" (its on the cloud).

Before this runs <1 tok/s for the smallest model so decent speed up, not sure how the speed is comparison with other quantization formats for models of similar size with CPU-only, have not actively tried them.

CPU Benchmarks (fa=1, CPU-only build)

Model Threads pp512 (t/s) tg128 (t/s)
Bonsai-1.7B 4 65.0 ± 3.8 41.1 ± 1.2
Bonsai-1.7B 8 128.5 ± 6.5 52.2 ± 0.2
Bonsai-1.7B 10 153.1 ± 5.6 57.4 ± 3.0
Bonsai-4B 4 27.0 ± 1.8 20.0 ± 0.6
Bonsai-4B 8 50.0 ± 3.3 34.0 ± 0.6
Bonsai-4B 10 59.7 ± 2.1 34.8 ± 0.3
Bonsai-8B 4 14.9 ± 0.3 12.2 ± 0.2
Bonsai-8B 8 27.6 ± 1.1 20.4 ± 1.0
Bonsai-8B 10 33.9 ± 1.3 22.9 ± 0.5

KL divergence with unpacked version:

Build Model Mean KLD Same Top Token Status
CPU 1.7B 0.000261 ± 0.000009 99.22% PASS
CPU 4B 0.000214 ± 0.000014 99.14% PASS
CPU 8B 0.000200 ± 0.000008 99.61% PASS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants