fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization by wildcattrio · Pull Request #7 · PrismML-Eng/llama.cpp

wildcattrio · 2026-04-02T17:31:26Z

Summary

The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:

Bug 1: Float-to-int truncation (causes gibberish)

The per-block accumulator is int, but d1 * sumi_block produces a float. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.

// Before (broken):
int sumi = 0;
sumi += d1 * sumi_block;  // float truncated to int

// After (fixed):
float block_sum = 0.0f;
block_sum += d1 * (float)sumi_block;

Bug 2: No SIMD (causes 0.25 tok/s)

The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same broadcast → shuffle → cmpeq → mul_sum_i8_pairs_float pattern from the existing ggml_vec_dot_q1_0_q8_0 kernel.

Results (i5-1135G7, 32GB, Bonsai 8B)

Version	tok/s	Output
Before (MSVC, shipped)	0.25	`, with is it. and the. and the.... in.........`
Bug fix only (scalar)	3.7	Correct
Bug fix + AVX2	6.9	Correct

Files changed

ggml/src/ggml-cpu/arch/x86/quants.c — AVX2 kernel + scalar fix
ggml/src/ggml-cpu/quants.c — generic scalar fallback fix

Test plan

Bonsai 8B: coherent output on Q&A, reasoning, and JSON extraction prompts
Bonsai 4B: coherent output (also tested)
Standard GGUF (Qwen 3.5 4B Q4_K_M): no regression, loads and runs correctly
Benchmark: 3 prompts × 8 configurations, all results in JSON

The Q1_0_g128 x86 kernel has two bugs causing gibberish output at 0.25 tok/s on Intel CPUs: 1. Float-to-int truncation: the per-block accumulator was `int`, truncating `d1 * sumi_block` (float * int → float → int). Each Q8_0 block's scale factor was rounded to 0 or ±1, destroying the output. Fix: `float block_sum` accumulator. 2. No SIMD: the x86 path was scalar-only while ARM NEON had full vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float. Results on i5-1135G7 with Bonsai 8B: Before (MSVC): 0.25 tok/s, gibberish output Bug fix only: 3.7 tok/s, correct output Bug fix + AVX2: 6.9 tok/s, correct output Both the x86-specific kernel (arch/x86/quants.c) and the generic fallback (quants.c) are fixed.

khosravipasha · 2026-04-02T18:46:32Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

zcattacz · 2026-04-03T09:55:47Z

Running 8B on i5 box with this PR, I get consistent
[ Prompt: 1.8 t/s | Generation: 2.2~.2.4 t/s ] performance.

After swapped out defined(__AVX2__) logic with PR4 's xor + sub logic. (PR4 is based on an old commit, difficult to tinker)

I get consistent
[ Prompt: 2.7 t/s | Generation: 2.2~.2.4 t/s ] performance.

Below alternative code on i5 Broadwell gives:
~8 tps prompt parsing and ~6 tps generation for 4B
~4 tps prompt parsing and ~3 tps generation for 8B

/*
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 115.23 seconds per pass - ETA 48.00 minutes
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
*/

#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();
    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        // Unrolling removes one if
        __m256 acc_block;
        {
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[0].qs);
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[0]), byte_shuf), bit_masks), zero);
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(y, sm), sm);
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16);
            // Avoid high KLD Max caused by AxB+0
            acc_block = _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[0].d)), _mm256_cvtepi32_ps(s32));
        }
    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(y, sm), sm); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            acc_block = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
        }
        Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

The 0.0002 KLD seems persistent on AVX2 across different basic implementations.

//impl4-macro6
/*
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 164.69 seconds per pass - ETA 1 hours 8.62 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
*/
#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();
    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        __m256 acc_block;
    #define Q1_AVX2_BLOCK(K) \
    { \
        const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
        /* sm is 0xFF where bit == 0 (should be subtracted), 0x00 where bit == 1 (should be added) */ \
        const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
        \
        /* Isolate the negative and positive y values using masks */ \
        const __m256i y_neg = _mm256_and_si256(y, sm); \
        const __m256i y_pos = _mm256_andnot_si256(sm, y); \
        \
        /* Widen to 16-bit safely. Even if y_neg contains -128, it sits in a 16-bit register now */ \
        const __m256i sum_pos = _mm256_maddubs_epi16(ones_8, y_pos); \
        const __m256i sum_neg = _mm256_maddubs_epi16(ones_8, y_neg); \
        \
        /* Subtract at 16-bit precision: 0 - (-128) safely equals +128 */ \
        const __m256i s16 = _mm256_sub_epi16(sum_pos, sum_neg); \
        const __m256i s32 = _mm256_madd_epi16(s16, ones_16); \
        \
        /* Accumulate as float */ \
        acc_block = (K == 0) \
            ? _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32)) \
            : _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
    }
            Q1_AVX2_BLOCK(0) Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

// impl5
/*
4B Model [ Prompt: 3.8 t/s | Generation: 3.0 t/s 
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 232.33 seconds per pass - ETA 1 hours 36.80 minutes
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9553 ±    3.1774      -0.00022 ±    0.00221       0.00021 ±    0.00002     0.359 ±  0.043 %    98.431 ±  0.780 %
   2      20.1932 ±    3.4345       0.01401 ±    0.01147       0.00022 ±    0.00002     0.334 ±  0.030 %    99.020 ±  0.437 %
   3      20.8668 ±    2.7899       0.00985 ±    0.00767       0.00022 ±    0.00001     0.372 ±  0.028 %    98.824 ±  0.390 %
   4      21.2224 ±    2.3914       0.00745 ±    0.00577       0.00022 ±    0.00001     0.367 ±  0.023 %    98.824 ±  0.338 %
*/
#if defined(__AVX2__)
// STRICT SCALAR MATH REPRODUCTION
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    
    // We replace __m256 acc with a single scalar float!
    float final_acc = 0.0f; 

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        
        float acc_block_scalar = 0.0f;

    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i qy = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(qy, sm), sm); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            \
            /* 1. Force horizontal integer sum of the 8 lanes down to 1 integer */ \
            __m128i sum128 = _mm_add_epi32(_mm256_castsi256_si128(s32), _mm256_extracti128_si256(s32, 1)); \
            sum128 = _mm_hadd_epi32(sum128, sum128); \
            sum128 = _mm_hadd_epi32(sum128, sum128); \
            int exact_dot_product = _mm_cvtsi128_si32(sum128); \
            \
            /* 2. Convert to float and multiply by scale exactly like C reference */ \
            acc_block_scalar += (float)exact_dot_product * GGML_CPU_FP16_TO_FP32(y_ptr[K].d); \
        }

        Q1_AVX2_BLOCK(0) Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK

        // Accumulate into scalar
        final_acc += d0 * acc_block_scalar;
    }

    *s = final_acc;

zcattacz · 2026-04-03T13:54:01Z

the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM.

#elif defined(__SSE4_2__) || defined(__SSSE3__)
    // Optimized SSE4.2/SSSE3 path for Q1_0_g128 · Q8_0
    // This uses 128-bit registers to process 16 elements at a time.
    const __m128i ones_8    = _mm_set1_epi8(1);
    const __m128i ones_16   = _mm_set1_epi16(1);
    
    // This shuffle mask spreads the 1st byte of a register to the first 8 slots,
    // and the 2nd byte to the next 8 slots.
    const __m128i byte_shuf = _mm_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1);
        
    const __m128i bit_masks = _mm_setr_epi8(
        1,2,4,8,16,32,64,(char)128,
        1,2,4,8,16,32,64,(char)128);
    const __m128i zero = _mm_setzero_si128();

    __m128 acc = _mm_setzero_ps();

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m128 acc_block = _mm_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m128 vdk = _mm_set1_ps(dk);
            
            // Load 32 bytes of Q8_0 weights into two 16-byte SSE registers
            const __m128i qy_l = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs));
            const __m128i qy_h = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs + 16));

            const uint32_t bits = qs32[k];

            // Process Lower 16 elements (using lower 16 bits of the mask)
            const __m128i mask_l = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits & 0xFFFF)), byte_shuf);
            const __m128i sm_l   = _mm_cmpeq_epi8(_mm_and_si128(mask_l, bit_masks), zero);
            const __m128i sy_l   = _mm_sub_epi8(_mm_xor_si128(qy_l, sm_l), sm_l);
            const __m128i s32_l  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_l), ones_16);

            // Process Upper 16 elements (using upper 16 bits of the mask)
            const __m128i mask_h = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits >> 16)), byte_shuf);
            const __m128i sm_h   = _mm_cmpeq_epi8(_mm_and_si128(mask_h, bit_masks), zero);
            const __m128i sy_h   = _mm_sub_epi8(_mm_xor_si128(qy_h, sm_h), sm_h);
            const __m128i s32_h  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_h), ones_16);

            // Convert integer sums to float, scale by dk, and add to block accumulator
            const __m128i s32_total = _mm_add_epi32(s32_l, s32_h);
            acc_block = _mm_add_ps(acc_block, _mm_mul_ps(vdk, _mm_cvtepi32_ps(s32_total)));
        }

        // Final scale by d0 (block scale) and add to global accumulator
        acc = _mm_add_ps(acc, _mm_mul_ps(_mm_set1_ps(d0), acc_block));
    }

    // Horizontal reduction of the 4 float lanes in the SSE register
    {
        acc = _mm_add_ps(acc, _mm_movehl_ps(acc, acc));
        acc = _mm_add_ss(acc, _mm_shuffle_ps(acc, acc, 1));
        *s = _mm_cvtss_f32(acc);
    }

AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as Dot Product=2×popcount(A XNOR B)−Total Bits using _mm_popcnt_u64 ...
@khosravipasha, I guess these devices needs a different quantization format :-D ?

wildcattrio · 2026-04-03T18:33:21Z

KL Divergence Results — x86 AVX2 (PR #7)

Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: 8194 (1179bfc82) with Clang 22.1.2.

System info: CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1

Test setup

F16 reference: dequantized from Bonsai-1.7B.gguf (Q1_0_g128) using llama-quantize --allow-requantize
Dataset: wikitext-2-raw, -c 512 --chunks 100
F16 perplexity: PPL = 24.04, 41.5 tok/s prompt processing
Q1_0_g128 perplexity: PPL = 24.09, 27.0 tok/s prompt processing

x86 AVX2 Divergences

Metric	Q1_0_g128 (1.13 BPW)
Same top p	99.220 ± 0.055 %
Mean KLD	0.000224 ± 0.000002
Maximum KLD	0.013303
99.9% KLD	0.002923
99.0% KLD	0.001273
Median KLD	0.000150
1.0% KLD	-0.000008
Minimum KLD	-0.000083
Mean Δp	-0.005 ± 0.002 %
Maximum Δp	4.779 %
99.9% Δp	2.198 %
99.0% Δp	1.122 %
95.0% Δp	0.511 %
Median Δp	-0.000 %
5.0% Δp	-0.534 %
1.0% Δp	-1.120 %
Minimum Δp	-3.440 %
RMS Δp	0.352 ± 0.005 %

Comparison with PR #8 reference (ARM NEON / generic scalar)

Metric	ARM NEON	x86 AVX2 (this PR)
Same top p	99.965 %	99.220 %
Mean KLD	0.000000	0.000224
Maximum KLD	0.000065	0.013303
RMS Δp	0.006 %	0.352 %

The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies d0 * d1 as a combined scale per sub-block, while the scalar reference accumulates d1 * sumi_block per sub-block then multiplies by d0 at the end. This difference in FP associativity accumulates across the 28 layers.

Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern (acc_block += dk * dot, then acc += d0 * acc_block) which more closely matches the scalar reference FP ordering — it may produce better KLD numbers. Worth testing.

Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04).

zcattacz · 2026-04-05T09:19:42Z

Hi @wildcattrio , I updated the implementation and here is the combined result. You were right xor+sub gives slightly better KLD with good tps. I also tried other impl for tps, the best were on par, but this is the simplest.

Metric	Generic Scalar Fallback	PR7 on TigerLake	PR7 on Broadwell	xor+sub on Broadwell
Same top p	99.965 ± 0.012 %	99.220 ± 0.055 %	99.216 ± 0.055 %	99.220 ± 0.055 %
Mean KLD	0.000000 ± 0.000000	0.000224 ± 0.000002	0.000223 ± 0.000002	0.000222 ± 0.000002
Maximum KLD	0.000065	0.013303	0.012099	0.010006
99.9% KLD	0.000051	0.002923	0.003226	0.003014
99.0% KLD	0.000036	0.001273	0.001277	0.001232
Median KLD	0.000000	0.000150	0.000151	0.000148
1.0% KLD	-0.000036	-0.000008	-0.000008	-0.000008
Minimum KLD	-0.000061	-0.000083	-0.000124	-0.000114
Mean Δp	-0.000 ± 0.000 %	-0.005 ± 0.002 %	-0.004 ± 0.002 %	-0.001 ± 0.002 %
Maximum Δp	0.120 %	4.779 %	6.801 %	5.948 %
99.9% Δp	0.039 %	2.198 %	2.232 %	2.194 %
99.0% Δp	0.017 %	1.122 %	1.168 %	1.109 %
95.0% Δp	0.007 %	0.511 %	0.510 %	0.509 %
Median Δp	0.000 %	-0.000 %	-0.000 %	-0.000 %
5.0% Δp	-0.008 %	-0.534 %	-0.534 %	-0.517 %
1.0% Δp	-0.017 %	-1.120 %	-1.132 %	-1.096 %
0.1% Δp	-0.039 %	—	-2.314 %	-2.139 %
Minimum Δp	-0.102 %	-3.440 %	-3.624 %	-3.849 %
RMS Δp	0.006 ± 0.000 %	0.352 ± 0.005 %	0.360 ± 0.006 %	0.349 ± 0.005 %
Mean PPL(Q)			24.089197 ± 0.527720	24.089508 ± 0.527763
Mean PPL(Base)			24.036712 ± 0.525238	24.036712 ± 0.525238
Mean PPL(Q) - PPL(Base)			0.052485 ± 0.012109	0.052796 ± 0.012114
prompt eval TPS			10.25	17.68

Copilot

Pull request overview

Fixes incorrect output and improves performance for the Q1_0_g128 × Q8_0 x86 CPU dot-product kernel by correcting float accumulation and adding an AVX2 vectorized implementation aligned with existing bit-expansion patterns in the x86 quant kernels.

Changes:

Fix float-to-int truncation by switching per-block accumulation to float in the generic kernel and x86 scalar fallback.
Add an AVX2 implementation for ggml_vec_dot_q1_0_g128_q8_0 using broadcast/shuffle/bit-test expansion and mul_sum_i8_pairs_float().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`ggml/src/ggml-cpu/quants.c`	Fixes scalar generic accumulation type to prevent truncation and incorrect results.
`ggml/src/ggml-cpu/arch/x86/quants.c`	Adds AVX2 vectorized path and fixes scalar fallback accumulation type for x86.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T08:08:40Z

ggml/src/ggml-cpu/arch/x86/quants.c

+            const __m256i qy = _mm256_loadu_si256((const __m256i *)yb->qs);
+
+            // Get 4 bytes of bits for this Q8_0 block
+            const uint32_t bits32 = *(const uint32_t *)&x[ib].qs[k * 4];


bits32 is loaded via a uint32_t* cast from x[ib].qs (*(const uint32_t *)&x[ib].qs[k * 4]), which can violate strict-aliasing rules and may be unaligned. Prefer copying into a local uint32_t with memcpy (similar to bytes_from_bits_32() earlier in this file) to safely preserve the bit pattern under optimization.

Suggested change

const uint32_t bits32 = *(const uint32_t *)&x[ib].qs[k * 4];

uint32_t bits32;

memcpy(&bits32, &x[ib].qs[k * 4], sizeof(bits32));

khosravipasha · 2026-04-06T19:07:58Z

Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp)

Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128.

https://github.com/PrismML-Eng/llama.cpp/tree/master

This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers).

khosravipasha · 2026-04-08T00:13:19Z

There is a lot of CPU PRs, planning to gether all in one and then send to the main llama.cpp
Going to close this and mention people that helped in a thread there, if you think your solution is better please comment there:
#10

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions bot added the ggml label Apr 2, 2026

zcattacz mentioned this pull request Apr 3, 2026

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI #6

Closed

khosravipasha requested a review from Copilot April 6, 2026 08:03

Copilot started reviewing on behalf of khosravipasha April 6, 2026 08:04 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

This was referenced Apr 6, 2026

(Performance) Optimized x86 and generic q1_0(_g128) dot #10

Open

x86: implement AVX2 kernel for ggml_vec_dot_q1_0_g128_q8_0 #11

Closed

khosravipasha closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel

wildcattrio commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

wildcattrio commented Apr 3, 2026

Uh oh!

zcattacz commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

khosravipasha commented Apr 6, 2026

Uh oh!

khosravipasha commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	const uint32_t bits32 = (const uint32_t )&x[ib].qs[k * 4];
	uint32_t bits32;
	memcpy(&bits32, &x[ib].qs[k * 4], sizeof(bits32));

Conversation

wildcattrio commented Apr 2, 2026

Summary

Bug 1: Float-to-int truncation (causes gibberish)

Bug 2: No SIMD (causes 0.25 tok/s)

Results (i5-1135G7, 32GB, Bonsai 8B)

Files changed

Test plan

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wildcattrio commented Apr 3, 2026

KL Divergence Results — x86 AVX2 (PR #7)

Test setup

x86 AVX2 Divergences

Comparison with PR #8 reference (ARM NEON / generic scalar)

Uh oh!

zcattacz commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

khosravipasha commented Apr 6, 2026

Uh oh!

khosravipasha commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zcattacz commented Apr 3, 2026 •

edited

Loading

zcattacz commented Apr 3, 2026 •

edited

Loading