Skip to content

ggml-cpu: add Q1_0 AVX2 fast path#21562

Closed
elusznik wants to merge 1 commit intoggml-org:masterfrom
elusznik:q1_0-x86-avx2
Closed

ggml-cpu: add Q1_0 AVX2 fast path#21562
elusznik wants to merge 1 commit intoggml-org:masterfrom
elusznik:q1_0-x86-avx2

Conversation

@elusznik
Copy link
Copy Markdown

@elusznik elusznik commented Apr 7, 2026

Overview

Adds an AVX2 SIMD fast path for ggml_vec_dot_q1_0_q8_0() in ggml/src/ggml-cpu/quants.c.

Q1_0 was missing an x86 kernel and fell back to a scalar loop. This patch implements the fast path using bytes_from_bits_32() and mul_sum_i8_pairs_float() helpers — added in this commit alongside the fast path itself — keeping it minimal and consistent with the q4/q5 kernel style. The scalar fallback remains intact for non-AVX2 builds, and the AVX2 path degrades gracefully to a mul+add sequence when FMA is not available.

Benchmark (AMD Ryzen 7 5800X, Bonsai-8B Q1_0, 16 threads):

test-quantize-perf --type q1_0 --op vec_dot_q -4:

Size Baseline (cycles/32) AVX2 (cycles/32) Speedup
4 KB 104.4 6.3 ~16.5x
64 KB 103.6 5.6 ~18.5x
2.5 MB 104.8 5.8 ~18.1x
250 MB 105.6 6.0 ~17.6x

llama-server --threads 16 --ctx-size 512:

Metric Baseline AVX2 Speedup
Prompt eval 1.24/s 18.64 ~15x
Generation 1.13/s 18.01 ~16x

Additional information

Follow-up to the existing ARM NEON Q1_0 implementation. The x86 AVX2 path uses the same algorithm adapted for x86 intrinsics.

Requirements

Copilot AI review requested due to automatic review settings April 7, 2026 14:19
@elusznik elusznik requested a review from ggerganov as a code owner April 7, 2026 14:19
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 7, 2026

Hi @elusznik, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an AVX2 SIMD fast path for the ggml_vec_dot_q1_0_q8_0_generic() (Q1_0 × Q8_0) dot product to avoid falling back to the scalar implementation on x86 CPUs.

Changes:

  • Introduces AVX2 helper intrinsics for horizontal float reduction, bit expansion (32 bits → 32 bytes), and packed int8 dot accumulation.
  • Adds an AVX2-accelerated inner loop for ggml_vec_dot_q1_0_q8_0_generic() with a scalar fallback for non-AVX2 builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@elusznik
Copy link
Copy Markdown
Author

elusznik commented Apr 7, 2026

fixed the copilot-reported issues

@elusznik elusznik requested a review from Copilot April 7, 2026 14:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +194 to +198
#if defined(__AVX2__) && defined(__FMA__)
acc = _mm256_fmadd_ps(d, q, acc);
#else
acc = _mm256_add_ps(acc, _mm256_mul_ps(d, q));
#endif
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside an existing #if defined(__AVX2__) region, the nested defined(__AVX2__) && check is redundant. Simplify this branch to only check __FMA__ to improve readability and reduce preprocessor clutter.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +33
res = _mm_add_ps(res, _mm_movehl_ps(res, res));
res = _mm_add_ss(res, _mm_movehdup_ps(res));
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_movehdup_ps is an SSE3 intrinsic, but this block is only guarded by __AVX2__. If the build configuration ever enables AVX2 without enabling SSE3 intrinsics (toolchain/flags mismatch), this can become a build issue. Consider rewriting the final reduction step using only SSE/SSE2 shuffles (or add an explicit compile-time requirement) so the AVX2 guard is sufficient.

Suggested change
res = _mm_add_ps(res, _mm_movehl_ps(res, res));
res = _mm_add_ss(res, _mm_movehdup_ps(res));
res = _mm_add_ps(res, _mm_shuffle_ps(res, res, _MM_SHUFFLE(2, 3, 0, 1)));
res = _mm_add_ss(res, _mm_shuffle_ps(res, res, _MM_SHUFFLE(1, 0, 3, 2)));

Copilot uses AI. Check for mistakes.
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 7, 2026

You're adding to the wrong file, you need to add in ggml/src/ggml-cpu/arch/x86/quants.c, and some of the functions already exist there

@elusznik elusznik closed this Apr 7, 2026
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 7, 2026
@khosravipasha
Copy link
Copy Markdown
Contributor

We have a few PR contributions in our public fork for x86 variants, planning to test them and chose the best ones and send a PR there, if anyone is curios in the meantime there is more discussions here:
PrismML-Eng#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants