Commit ba0e521
committed
fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI
The Q1_0_g128 vec_dot kernel for x86 produces garbage output due to a
float-to-int truncation bug: `sumi += d1 * sumi_block` accumulates a
float product into an int, silently truncating the result to zero for
small scale factors. This affects both the generic scalar fallback and
the x86 arch-specific implementation.
The ARM NEON implementation was correct and unaffected.
Changes:
- Fix generic scalar kernel (quants.c): accumulate `d0 * d1 * sumi`
into float, matching the working ARM scalar fallback pattern
- Replace x86 scalar-only kernel with three-tier implementation:
1. AVX-512 VNNI (BW+VL+VNNI): uses mask registers for single-
instruction bit expansion + VPDPBUSD for dot product
2. AVX2: shuffle-based bit expansion + sign_epi8 multiply
3. Scalar fallback: corrected accumulation
Benchmarks on AMD EPYC (Zen 4, 12 vCPU shared):
Before (broken): garbage output at ~0.5 tok/s
Scalar fix: correct output at ~3 tok/s
AVX2: correct output at ~28 tok/s
AVX-512 VNNI: correct output at ~50 tok/s (1.7B model)1 parent 1179bfc commit ba0e521
2 files changed
+111
-50
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
662 | 662 | | |
663 | 663 | | |
664 | 664 | | |
665 | | - | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
666 | 670 | | |
667 | | - | |
668 | | - | |
669 | | - | |
670 | 671 | | |
671 | 672 | | |
672 | | - | |
673 | | - | |
674 | | - | |
675 | | - | |
| 673 | + | |
676 | 674 | | |
677 | 675 | | |
678 | | - | |
679 | | - | |
680 | | - | |
681 | | - | |
682 | | - | |
683 | | - | |
684 | | - | |
685 | | - | |
686 | | - | |
687 | | - | |
688 | | - | |
689 | | - | |
690 | | - | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
691 | 766 | | |
692 | | - | |
693 | | - | |
| 767 | + | |
| 768 | + | |
694 | 769 | | |
695 | | - | |
696 | | - | |
697 | 770 | | |
| 771 | + | |
698 | 772 | | |
699 | 773 | | |
700 | 774 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
176 | 176 | | |
177 | 177 | | |
178 | 178 | | |
179 | | - | |
180 | | - | |
181 | | - | |
182 | | - | |
| 179 | + | |
| 180 | + | |
183 | 181 | | |
184 | 182 | | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
| 183 | + | |
189 | 184 | | |
190 | 185 | | |
191 | | - | |
192 | | - | |
193 | | - | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
194 | 190 | | |
195 | | - | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
| 191 | + | |
| 192 | + | |
204 | 193 | | |
205 | | - | |
206 | | - | |
| 194 | + | |
| 195 | + | |
207 | 196 | | |
208 | | - | |
209 | | - | |
210 | 197 | | |
211 | | - | |
| 198 | + | |
212 | 199 | | |
213 | 200 | | |
214 | 201 | | |
| |||
0 commit comments