ggml-cpu: add Q1_0 AVX2 fast path#21562
Conversation
|
Hi @elusznik, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an AVX2 SIMD fast path for the ggml_vec_dot_q1_0_q8_0_generic() (Q1_0 × Q8_0) dot product to avoid falling back to the scalar implementation on x86 CPUs.
Changes:
- Introduces AVX2 helper intrinsics for horizontal float reduction, bit expansion (32 bits → 32 bytes), and packed int8 dot accumulation.
- Adds an AVX2-accelerated inner loop for
ggml_vec_dot_q1_0_q8_0_generic()with a scalar fallback for non-AVX2 builds.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
fixed the copilot-reported issues |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if defined(__AVX2__) && defined(__FMA__) | ||
| acc = _mm256_fmadd_ps(d, q, acc); | ||
| #else | ||
| acc = _mm256_add_ps(acc, _mm256_mul_ps(d, q)); | ||
| #endif |
There was a problem hiding this comment.
Inside an existing #if defined(__AVX2__) region, the nested defined(__AVX2__) && check is redundant. Simplify this branch to only check __FMA__ to improve readability and reduce preprocessor clutter.
| res = _mm_add_ps(res, _mm_movehl_ps(res, res)); | ||
| res = _mm_add_ss(res, _mm_movehdup_ps(res)); |
There was a problem hiding this comment.
_mm_movehdup_ps is an SSE3 intrinsic, but this block is only guarded by __AVX2__. If the build configuration ever enables AVX2 without enabling SSE3 intrinsics (toolchain/flags mismatch), this can become a build issue. Consider rewriting the final reduction step using only SSE/SSE2 shuffles (or add an explicit compile-time requirement) so the AVX2 guard is sufficient.
| res = _mm_add_ps(res, _mm_movehl_ps(res, res)); | |
| res = _mm_add_ss(res, _mm_movehdup_ps(res)); | |
| res = _mm_add_ps(res, _mm_shuffle_ps(res, res, _MM_SHUFFLE(2, 3, 0, 1))); | |
| res = _mm_add_ss(res, _mm_shuffle_ps(res, res, _MM_SHUFFLE(1, 0, 3, 2))); |
|
You're adding to the wrong file, you need to add in |
|
We have a few PR contributions in our public fork for x86 variants, planning to test them and chose the best ones and send a PR there, if anyone is curios in the meantime there is more discussions here: |
Overview
Adds an AVX2 SIMD fast path for
ggml_vec_dot_q1_0_q8_0()inggml/src/ggml-cpu/quants.c.Q1_0 was missing an x86 kernel and fell back to a scalar loop. This patch implements the fast path using
bytes_from_bits_32()andmul_sum_i8_pairs_float()helpers — added in this commit alongside the fast path itself — keeping it minimal and consistent with the q4/q5 kernel style. The scalar fallback remains intact for non-AVX2 builds, and the AVX2 path degrades gracefully to a mul+add sequence when FMA is not available.Benchmark (AMD Ryzen 7 5800X, Bonsai-8B Q1_0, 16 threads):
test-quantize-perf --type q1_0 --op vec_dot_q -4:llama-server --threads 16 --ctx-size 512:Additional information
Follow-up to the existing ARM NEON Q1_0 implementation. The x86 AVX2 path uses the same algorithm adapted for x86 intrinsics.
Requirements