(Performance) Optimized x86 and generic q1_0(_g128) dot#10
(Performance) Optimized x86 and generic q1_0(_g128) dot#10pl752 wants to merge 3 commits intoPrismML-Eng:masterfrom
Conversation
|
Thanks this looks great, nice write up. I am not too familiar with SIMD/AVX stuff, what CPUs does this support: |
|
@khosravipasha You are welcome :)
As for perplexity, I have performed run for single 64 token wikitext-2-test chunk with 1.7B model
I will perform more runs |
|
I have run 5 chunks of 512 tokens, looks better, I think, will run 100 chunks:
|
|
I am somewhat in doubt now, it seems something around the effect of comparing cpu to cuda, or something inbetween fp32->fp16 and fp32->q8_0, maybe it is from using smaller model |
|
@pl752 Awesome thanks for the explnations. https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity Yeah I used running the model in fp16 as the baeslines using these https://huggingface.co/collections/prism-ml/bonsai-auxiliary |
|
Okay, don't forget to thank the user from which I've hijacked AVX-512 implementation |
|
@pl752 good idea, which one was it? We can tag them here, After that's merged, then can all send a PR together with everyone that contributed tagged in main llama.cpp maybe. Note that there will be some naming changes (in summary Q1_0_g128 is renamed to Q1_0, and original Q1_0 will be deleted). Should not affect running the current models. |
|
|
Performed additional 5x512 run against unpacked gguf
|
|
UPD: I have reviewed how I was interleaving instructions when testing various register pressure options and found issues resulting in register spilling, so I just relied on the compiler doing its job properly and simply unrolled inner loop with individual accumulators for SSSE3 (as the compiler already did pretty well for other flows); I have also tried the same thing for AVX-512, but it did result in tiny performance regression. It had almost no effect on perplexity. Effects on performance, (baseline has drifted due to using
|
| flow | run | baseline | updated | delta |
|---|---|---|---|---|
| SSSE3 | pp512 | 33.38 t/s | 39.18 t/s | +17.36% |
| SSSE3 | tg128 | 24.61 t/s | 29.24 t/s | +18.81% |
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving CPU inference throughput by optimizing the q1_0 / q1_0_g128 dot-product kernels against q8_0, reducing bit-twiddling overhead in portable fallbacks and introducing additional optimized x86 SIMD execution paths.
Changes:
- Reworked generic fallbacks to process packed sign bits in a byte-oriented way (4 × 8-value groups per 32-element sub-block), eliminating per-element bit index arithmetic.
- Implemented x86-specialized kernels for
ggml_vec_dot_q1_0_q8_0andggml_vec_dot_q1_0_g128_q8_0with multiple SIMD paths (SSSE3 / AVX / AVX2 / AVX-512BW) plus scalar byte-oriented fallback. - Added small SSSE3 helpers to expand packed sign bits into byte masks and to reduce vector accumulators.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
ggml/src/ggml-cpu/quants.c |
Optimizes portable q1_0 and q1_0_g128 generic dot fallbacks by switching to explicit byte-oriented sign decoding and removing per-element bit math. |
ggml/src/ggml-cpu/arch/x86/quants.c |
Replaces x86 dispatch to generic kernels with specialized SIMD implementations across AVX-512BW/AVX2/AVX/SSSE3, keeping a byte-oriented scalar fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I tested the AVX2 impl, slightly faster then #7 (see the est full test time) but slower than xor+sub. Maybe the reported 0.00022 KLD is arch related (Tigerlake and Broadwell are both Intel CPU). I have tried several impl on Broadwell, all hit the same KLD after the first few chunks, thus later there's little point to run the full test just to confirm the KLD. |
|
@zcattacz Thank you for the hint, it worked at least for at least AVX2, I will revise my current kernels and post updates |
|
@pl752 , oh. my bad, I misread your KLD. Are they all tested on AMD. it's also around 0.00022. The xor+sub is adapted from PR4. If you are after speed, please give it a try. You can find the code I tested for AVX2 from my comment in #7. Even the shadowed variable gives it a 5%~10% boost. I also tested double accumulator impl, but it didn't give any edge. The compiler seems to be doing some magic here. |
|
@zcattacz They all tested on AMD Ryzen 5 7640HS (Zen 4) |
|
UPD2: Okay, I have applied the advice and it resulted in positive performance changes and no significant perplexity changes. However I removed the AVX512 branch now relying on compiler taking advantage of some of the register and instruction layout changes, as I failed to achieve meaningful performance increase past current AVX2 flow on my Zen4, moreover AVX2 is variant is slightly faster even. Somebody with Zen5 or modern Intel Xeon should take a look and experiment. Performance changes (t=10)
|
|
Hi @pl752, nice improvement. If you want to squeeze a bit more juice, pls try the code in #7 (comment) , it's simpler but the compiler makes the single accumulator impl even faster than the double accumulator (gives another 2~3tps for free on an i5). |
|
That's interesting, two accumulators gave better performance before the xor+sub change, now it's other way around; baselines has drifted once more though
Couldn't confirm KLD changes thhough, @zcattacz , check that I haven't made a mistake there |
|
Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp) Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128. https://github.com/PrismML-Eng/llama.cpp/tree/master This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers). |
|
@pl752, yeah, I also tested double accumulator with a full unrolled version, but the speed is still a marginal net loss. Looks like FMA is not the bottleneck. Could you try this and see if it works better for you? Since it's slow, I only run the full perplexity test in the initial tuning. Can't find the numbers, but I recall my Max KLD was way higher than what was reported in PR7. It drops down after the initial FMA is replaced with simple MUL. |
|
That (sign alt) was few percent slower unfortunately in my case too |
|
Brought the code to uniform structure, insignificant changes in ASM for AVX, no measurable changes in perplexity or performance, will prepare for rebase |
b793ed1 to
195593b
Compare
|
I think yes, we can write a draft and then send it to main tree. However I am going to sleep currently, so I will help later. I have been testing perplexity all the way through to avoid breaking things, but additional tests won't hurt. Then benchmarks for the scalar and SIMD branches need to be redone and cleaned up summary added. In my current implementation SSSE3 is for most of cpus, AVX helps with fp32 accum part, but difference in performance is questionable, AVX 2 handles modern-ish cpus as the most performant way, and AVX-512 specific branch was discarded (due to me failing to obtain any improvements over AVX2 with AVX-512 flag set on my Zen4), but it is still mentioned in my latest benchmarks, as compiler still produces more optimized code for AVX2 with AVX-512 enabled due to AVX-512 providing 32 SIMD registers instead of 16 (aside from fact that their max length extends to 512 bit, which isn't used there) allowing some additional freedom during applying O3 opts. |
|
So I can try creating PR draft myself tommorow and tag everybody from discussion and remained not my own code if any is left, then we will look into the next steps, the code itself seems to be pretty clean. |
|
Thanks @pl752 of course take your time, sleep is more important :D Closed the other CPU PRs. We can collectively send a PR to main llama.cpp when this branch is ready. Tagging people that helped in other PRs (let me know if I missed any) |
|
Comet Lake data point from my i7-10510U (4C/8T, laptop, Windows MinGW-w64 UCRT gcc 15.2), same
Speedups vs generic scalar on the same branch: One pattern worth noting: on this chip |
|
@Marxist-Leninist Thank you for an additional insight. Using SMT is known to significantly increase memory pressure, this is the reason I have used physical core count on initial benchmarks to avoid memory bottleneck then through benchmarks I found out that for my system 10 threads (logical thr count - 2) was yielding max tg and near max pp, while setting to 12 threads (all threads) didn't significantly increase pp, while tg has slightly reduced, so that's pretty common thing for systems with many threads (also there was/is recommendation to use nproc - 2, or even ncore - 2 in case system has many cores, as usually memory in system designs is scaled worse than compute power (aka 16 core ryzens get same bandwidth (ddr4/5 dual channel) as lower core counts and systems like threadrippers can have even higher core to memory bandwidth ratio. |
|
Thanks @pl752 — that matches exactly. Just ran the same branch on
Same pattern holds at 8B scale: Continuity note with my closed #4: my original d603bf4 AVX2 kernel on the same chip + same 8B model measured 4.7 pp / 3.1 tg at |
|
Two more data points on the same i7-10510U, this branch at Bonsai-4B, completing the 1.7B → 4B → 8B series:
Consistent with the 1.7B and 8B results I posted earlier — the Intel UHD has CPU-side micro-optimizations on 8B:
Two things worth flagging for the upstream PR:
The measurable hard ceiling on this chip is ~4.7 t/s on 8B tg128, bandwidth-bound by single-channel DDR4-2667. Notably, 4B and 8B run at identical ~3.8 t/s on this chip because both are already hitting the RAM bandwidth wall — zero throughput penalty for picking 8B over 4B on this hardware class. |
Runtime tuning notes from testing this branch (Windows / MinGW, Comet Lake)Kicked the tires on this branch against 8B Q1_0_g128 on a laptop CPU and a Gen9.5 iGPU — posting the numbers in case they help triage follow-ups or adjacent PRs (#9 Vulkan, #18 CUDA). Results (8B Q1_0_g128, -t 4, 512-prompt / 128-gen, llama-server /v1/chat)
Observations worth flagging
Environment
Happy to rerun with specific flags if any of these numbers look off — the --mlock vs no-mlock gap is the most reproducible of the bunch. |
|
One more round of data on this PR — trying to find speedup on 8B tg and hitting a ceiling that isn't in the kernel. tl;drThe kernel in this PR isn't the bottleneck on either of the CPUs I tested. On a power-limited laptop it's PL1 throttling, and on a Skylake-SP Xeon VM the kernel is still fine — a hand-written AVX-512BW variant was within noise on 8B and slower on 1.7B. Signal seems to be that future wins live at the framework/scheduling layer, not in the SIMD inner loop. i7-10510U (Comet Lake, 4C/8T, 15W, DDR4-2667 single-channel)Thread sweep (Bonsai-8B Q1_0, tg128,
Memory bandwidth vs achieved throughput (AVX2 aligned-load microbench on the 1.07 GB Q1_0 file):
One thread already saturates ~88% of the practical DDR4-2667 single-channel ceiling (~16.9 GB/s). But llama.cpp Q1_0 tops out at ~35% of that. There's ~2.8× theoretical headroom that isn't in the kernel — it's in the per-layer orchestration (OpenMP barriers between the ~32 layers, softmax/layernorm serialization, non-weight traffic, scale loads, KV reads). CPU frequency under load:
Instant recovery rules out thermal — this is the i7-10510U firmware PL1=15W clamp kicking in after the ~28s Tau window. Theoretical unlocked ceiling is 3.4 / 1.95 ≈ 1.74×, which would put 8B at ~8.2 t/s. Needs admin + ThrottleStop/XTU MSR writes; can't be fixed from llama.cpp. Null-result tweaks (all within noise of 4.72 t/s baseline):
Skylake-SP Xeon VM (AVX2 + AVX-512F/BW/CD/DQ/VL, no VNNI, 16 cores)Baseline on Tried a hand-written AVX-512BW variant of
The 1.7B regression is statistically significant. Two suspected causes:
A VNNI ( What I think the PR is actually doneScott's xor+sub kernel (now the hot path here) is extracting everything the clock will give. On both test machines the bottleneck is outside the vec_dot — PL1 on laptop, parallel-sync overhead + per-layer serial work on the Xeon. Worth keeping in mind if anyone else shows up with "let me try AVX-512 / NEON SVE2 / etc" — the 2.8× theoretical ceiling from membw is there, but you have to go looking for it in the layer loop, not in the dot product. |
|
Okay, @khosravipasha , so what do we need to do next? Do I just create PR for main llama.cpp or something else needs to be done? I think the code is pretty much ready and also I have acquired final benchmark numbers and tested the perplexity and test-quantize-fns once more |
|
I have opened a draft ggml-org#21636, so it is visible on main repo |
|
Sounds good thanks for putting it all together. |
|
So, what are the next steps? |
|
I also think (due to things @Marxist-Leninist is describing) that some alternative geometry options can be explored (nrows > 1, or even more ambitious things like repack and specialized kernels) to try to aleviate suspected memory pressure at least for |
|
I have vibe-coded small experiment ( PatchBenchmark
Main purpose there was to use activation matrix twice (which is heavier on bandwidth) |
|
@khosravipasha, I would really like to know what else (if anything) should we do before we push the PR for review (remove draft status), as I feel a little bit awkward for some reason (or maybe I am just hurrying too much)? |
|
This is a good start I feel and much better than falling back to generic. I guess people can do separate PRs if they get massive improvements. |
|
I think that since it performs significantly better than current options, I will undraft PR and then maybe open second one, as I want to add second round of branches with nrc == 2 |
|
Have opened draft PR #21 (mostly for demonstration and discussion) related to nrows and geometry optimizations. |
Hello
This is yet another PR about the
fix of the truncation andoptimization of the cpu inference.In this case I have:
Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow
Benchmarks were performed with:
Bonsai-1.7B.gguf(Q1_0_g128)6pp 512t/stg 128t/sSSSE3AVXAVX2+FMAAVX512BW*extrapolated frompp 32/tg 16:1.659 t/spp and0.862 t/stg, as I was impatient.**new SIMD instruction kinds improve performance even on AMD Zen4 implementation of AVX-512, which uses 256 bit pipeline twice instead of implementing full 512 bit oneI would appreciate your feedback