ggml-cpu: add Q1_0 AVX2 fast path by elusznik · Pull Request #21562 · ggml-org/llama.cpp

elusznik · 2026-04-07T14:19:12Z

Overview

Adds an AVX2 SIMD fast path for ggml_vec_dot_q1_0_q8_0() in ggml/src/ggml-cpu/quants.c.

Q1_0 was missing an x86 kernel and fell back to a scalar loop. This patch implements the fast path using bytes_from_bits_32() and mul_sum_i8_pairs_float() helpers — added in this commit alongside the fast path itself — keeping it minimal and consistent with the q4/q5 kernel style. The scalar fallback remains intact for non-AVX2 builds, and the AVX2 path degrades gracefully to a mul+add sequence when FMA is not available.

Benchmark (AMD Ryzen 7 5800X, Bonsai-8B Q1_0, 16 threads):

test-quantize-perf --type q1_0 --op vec_dot_q -4:

Size	Baseline (cycles/32)	AVX2 (cycles/32)	Speedup
4 KB	104.4	6.3	~16.5x
64 KB	103.6	5.6	~18.5x
2.5 MB	104.8	5.8	~18.1x
250 MB	105.6	6.0	~17.6x

llama-server --threads 16 --ctx-size 512:

Metric	Baseline	AVX2	Speedup
Prompt eval	1.24/s	18.64	~15x
Generation	1.13/s	18.01	~16x

Additional information

Follow-up to the existing ARM NEON Q1_0 implementation. The x86 AVX2 path uses the same algorithm adapted for x86 intrinsics.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ggml-gh-bot · 2026-04-07T14:24:25Z

Hi @elusznik, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an AVX2 SIMD fast path for the ggml_vec_dot_q1_0_q8_0_generic() (Q1_0 × Q8_0) dot product to avoid falling back to the scalar implementation on x86 CPUs.

Changes:

Introduces AVX2 helper intrinsics for horizontal float reduction, bit expansion (32 bits → 32 bytes), and packed int8 dot accumulation.
Adds an AVX2-accelerated inner loop for ggml_vec_dot_q1_0_q8_0_generic() with a scalar fallback for non-AVX2 builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ggml/src/ggml-cpu/quants.c

elusznik · 2026-04-07T14:33:11Z

fixed the copilot-reported issues

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T15:05:45Z

ggml/src/ggml-cpu/quants.c

+#if defined(__AVX2__) && defined(__FMA__)
+            acc = _mm256_fmadd_ps(d, q, acc);
+#else
+            acc = _mm256_add_ps(acc, _mm256_mul_ps(d, q));
+#endif


Inside an existing #if defined(__AVX2__) region, the nested defined(__AVX2__) && check is redundant. Simplify this branch to only check __FMA__ to improve readability and reduce preprocessor clutter.

Copilot · 2026-04-07T15:05:46Z

ggml/src/ggml-cpu/quants.c

+    res = _mm_add_ps(res, _mm_movehl_ps(res, res));
+    res = _mm_add_ss(res, _mm_movehdup_ps(res));


_mm_movehdup_ps is an SSE3 intrinsic, but this block is only guarded by __AVX2__. If the build configuration ever enables AVX2 without enabling SSE3 intrinsics (toolchain/flags mismatch), this can become a build issue. Consider rewriting the final reduction step using only SSE/SSE2 shuffles (or add an explicit compile-time requirement) so the AVX2 guard is sufficient.

Suggested change

res = _mm_add_ps(res, _mm_movehl_ps(res, res));

res = _mm_add_ss(res, _mm_movehdup_ps(res));

res = _mm_add_ps(res, _mm_shuffle_ps(res, res, _MM_SHUFFLE(2, 3, 0, 1)));

res = _mm_add_ss(res, _mm_shuffle_ps(res, res, _MM_SHUFFLE(1, 0, 3, 2)));

am17an · 2026-04-07T16:35:40Z

You're adding to the wrong file, you need to add in ggml/src/ggml-cpu/arch/x86/quants.c, and some of the functions already exist there

khosravipasha · 2026-04-08T14:58:58Z

We have a few PR contributions in our public fork for x86 variants, planning to test them and chose the best ones and send a PR there, if anyone is curios in the meantime there is more discussions here:
PrismML-Eng#10

Copilot AI review requested due to automatic review settings April 7, 2026 14:19

elusznik requested a review from ggerganov as a code owner April 7, 2026 14:19

Copilot AI reviewed Apr 7, 2026

View reviewed changes

ggml/src/ggml-cpu/quants.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/quants.c Show resolved Hide resolved

ggml/src/ggml-cpu/quants.c Outdated Show resolved Hide resolved

ggml-cpu: add Q1_0 AVX2 path

c761c4d

elusznik force-pushed the q1_0-x86-avx2 branch from 1c569b9 to c761c4d Compare April 7, 2026 14:29

elusznik requested a review from Copilot April 7, 2026 14:57

Copilot AI reviewed Apr 7, 2026

View reviewed changes

elusznik closed this Apr 7, 2026

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 7, 2026

Copilot started reviewing on behalf of elusznik April 7, 2026 22:03 View session

Copilot started reviewing on behalf of elusznik April 7, 2026 22:22 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add Q1_0 AVX2 fast path#21562

ggml-cpu: add Q1_0 AVX2 fast path#21562
elusznik wants to merge 1 commit intoggml-org:masterfrom
elusznik:q1_0-x86-avx2

elusznik commented Apr 7, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elusznik commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

am17an commented Apr 7, 2026

Uh oh!

khosravipasha commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		res = _mm_add_ps(res, _mm_movehl_ps(res, res));
		res = _mm_add_ss(res, _mm_movehdup_ps(res));

Conversation

elusznik commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elusznik commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented Apr 7, 2026

Uh oh!

khosravipasha commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elusznik commented Apr 7, 2026 •

edited

Loading