(Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up) by pl752 · Pull Request #21636 · ggml-org/llama.cpp

pl752 · 2026-04-08T16:02:36Z

Hello, I have prepared optimized implementation of cpu q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of PrismML-Eng#10 PR, list of experiments conducted and some other benchmark results can be found there

This PR implements:

More efficient (less bit math and multiplications) generic implementation of dot product for (q1_0; q8_0)
x86 SIMD specific implementations of dot product for (q1_0; q8_0) for most of the realistic x86_64 targets (from SSSE3 to AVX2)

Checks performed so far:

test-quantization-fns works passes
model behaves well
perplexity runs completed for 5x512 batches of wikitext-2-test (unpacked gguf as a reference, Bonsai 1.7B)
llama-bench runs for Bonsai 1.7B
verified that assembly is efficient in terms of lack of register spills and good pipeline pressure

Benchmark results for Bonsai 1.7B

Benchmarks were performed with:

CPU: AMD Ryzen 5 7640HS (at 65w)
WSL vm
LPDDR5 @ 6400MT JEDEC
Threads: 10

Flow	`pp 512` t/s	`tg 128` t/s	Speedup
Initial*	2.05	1.32	1.0x / 1.0x
Scalar	13.07	9.38	6.4x / 7.1x
`SSSE3`	43.43	32.56	21.2x / 24.6x
`AVX`	53.54	40.70	26.1x / 30.8x
`AVX` + `F16C`**	73.87	45.94	36.0x / 34.7x
`AVX2` + `FMA`	131.03	73.85	63.9x / 55.9x
`AVX512`	137.75	76.91	67.1x / 58.2x

"*": Results for current mainline variant were extrapolated due to me being impatient
"**": F16C is enabled for AVX2/512 too and disabled previously (to reflect cpu ISA generations)

Perplexity summary for Bonsai 1.7B

Metric	Scalar	`SSSE3`	`AVX`	`AVX2` + `FMA`
Same top p	99.451 ± 0.207 %	99.059 ± 0.271 %	99.373 ± 0.221 %	99.686 ± 0.157 %
Mean KLD	0.000213 ± 0.000008	0.000228 ± 0.000010	0.000235 ± 0.000010	0.000218 ± 0.000009
Maximum KLD	0.004783	0.004070	0.004658	0.005173
99.9% KLD	0.002648	0.003666	0.003888	0.003778
99.0% KLD	0.001295	0.001730	0.001676	0.001318
Median KLD	0.000129	0.000141	0.000143	0.000134
1.0% KLD	-0.000012	-0.000009	-0.000007	-0.000006
Minimum KLD	-0.000051	-0.000040	-0.000057	-0.000045
Mean Δp	0.000 ± 0.009 %	0.011 ± 0.010 %	0.000 ± 0.010 %	0.011 ± 0.010 %
Maximum Δp	2.770 %	2.917 %	2.709 %	3.366 %
99.9% Δp	1.851 %	2.036 %	2.166 %	2.707 %
99.0% Δp	1.192 %	1.359 %	1.314 %	1.268 %
95.0% Δp	0.486 %	0.534 %	0.540 %	0.551 %
Median Δp	-0.000 %	0.000 %	0.000 %	0.000 %
5.0% Δp	-0.465 %	-0.558 %	-0.576 %	-0.494 %
1.0% Δp	-1.020 %	-1.034 %	-1.099 %	-0.989 %
0.1% Δp	-1.888 %	-1.412 %	-1.783 %	-1.675 %
Minimum Δp	-2.109 %	-1.823 %	-1.859 %	-2.133 %
RMS Δp	0.334 ± 0.017 %	0.360 ± 0.018 %	0.362 ± 0.017 %	0.364 ± 0.022 %

Things still to be done:

~~Awaiting @khosravipasha approval~~
AVX512 implementation (I was unable to achieve meaningful improvements aside from opts from compiler) for Zen 4
Implementation for Zen 5 or modern Xeons as they have faster AVX512 pipeline
Implementing branches for nrc==2 as it shows potential for further speedup (pipeline is pretty hot in terms of memory bandwidth already), next PR soon probably
Maybe some experiments outside (repack -> specialized mmvq/mmq; experimenting with scratch buffer configurations)
I have risc-v sbc with vec size of 256 and fp/bf support (spacemit k1), so maybe future PR for risc-v SIMD (or even spacemit MMA?)

People who have also contributed

@khosravipasha (mainline Q1_0 support and QA)
@zcattacz (major help and useful guidance)
@jordankzf (inspired me to try to optimize cpu code)

(other people who provided useful insights or experimented themselves)

AI usage disclosure

Was used for automating benchmarks, some of the tests and creating tables
Was NOT used to write any other text for PR or human interaction
Was used for prototyping and iteration (guided by me, final code was mostly manually refined and tested)

I have read and agree with the contributing guidelines
AI usage disclosure: YES, see above

pl752 · 2026-04-08T19:56:22Z

~~Aaand, we are live~~ Okay, reviews, requests and questions are welcome

khosravipasha · 2026-04-10T05:49:14Z

Tested this on a x86 CPU I have access to, "AMD EPYC 7543 32-Core Processor" (its on the cloud).

Before this runs <1 tok/s for the smallest model so decent speed up, not sure how the speed is comparison with other quantization formats for models of similar size with CPU-only, have not actively tried them.

CPU Benchmarks (fa=1, CPU-only build)

Model	Threads	pp512 (t/s)	tg128 (t/s)
Bonsai-1.7B	4	65.0 ± 3.8	41.1 ± 1.2
Bonsai-1.7B	8	128.5 ± 6.5	52.2 ± 0.2
Bonsai-1.7B	10	153.1 ± 5.6	57.4 ± 3.0
Bonsai-4B	4	27.0 ± 1.8	20.0 ± 0.6
Bonsai-4B	8	50.0 ± 3.3	34.0 ± 0.6
Bonsai-4B	10	59.7 ± 2.1	34.8 ± 0.3
Bonsai-8B	4	14.9 ± 0.3	12.2 ± 0.2
Bonsai-8B	8	27.6 ± 1.1	20.4 ± 1.0
Bonsai-8B	10	33.9 ± 1.3	22.9 ± 0.5

KL divergence with unpacked version:

Build	Model	Mean KLD	Same Top Token	Status
CPU	1.7B	0.000261 ± 0.000009	99.22%	PASS
CPU	4B	0.000214 ± 0.000014	99.14%	PASS
CPU	8B	0.000200 ± 0.000008	99.61%	PASS

pl752 added 2 commits April 7, 2026 11:46

Implemented optimized q1_0 dot for x86 and generic

195593b

Removed redundant helper definition

e29cd48

pl752 mentioned this pull request Apr 8, 2026

(Performance) Optimized x86 and generic q1_0(_g128) dot PrismML-Eng/llama.cpp#10

Open

pl752 marked this pull request as ready for review April 8, 2026 19:52

pl752 requested a review from ggerganov as a code owner April 8, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up)#21636

(Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up)#21636
pl752 wants to merge 2 commits intoggml-org:masterfrom
pl752:perf/q1_0_g128_no_nofma

pl752 commented Apr 8, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 8, 2026 •

edited

Loading

Uh oh!

khosravipasha commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pl752 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR implements:

Checks performed so far:

Things still to be done:

People who have also contributed

(other people who provided useful insights or experimented themselves)

AI usage disclosure

Uh oh!

pl752 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 10, 2026

CPU Benchmarks (fa=1, CPU-only build)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pl752 commented Apr 8, 2026 •

edited

Loading

pl752 commented Apr 8, 2026 •

edited

Loading