You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ggml/2_QuantizationSystem.md
+53Lines changed: 53 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -460,3 +460,56 @@ The quantization flow in GGML is organized around three layers:
460
460
- **reference packing** in `src/ggml-quants.c`
461
461
- **format metadata and conversion hooks** through the core type-traits table
462
462
- **backend-specific execution hooks** through the CPU trait table, especially `from_float` and `vec_dot`
463
+
464
+
## Vector Dot Products
465
+
466
+
The most performance-critical quantized operation in GGML is the **direct dot product on packed quantized blocks**. Instead of dequantizing both inputs to FP32 first, GGML uses per-format CPU kernels through the `vec_dot` hook in `struct ggml_type_traits_cpu`. This is the mechanism that lets formats such as `Q4_0`, `Q4_K`, `IQ4_NL`, `MXFP4`, and `TQ1_0` participate in fast low-level arithmetic on packed data.
467
+
468
+
### `ggml_vec_dot_t`
469
+
470
+
The CPU dot-product callback type is declared in `include/ggml-cpu.h` as:
471
+
472
+
```c
473
+
typedef void (*ggml_vec_dot_t)(
474
+
int n,
475
+
float * GGML_RESTRICT s,
476
+
size_t bs,
477
+
const void * GGML_RESTRICT x,
478
+
size_t bx,
479
+
const void * GGML_RESTRICT y,
480
+
size_t by,
481
+
int nrc
482
+
);
483
+
```
484
+
485
+
This interface takes the logical element count `n`, writes the result to `s`, and accepts explicit byte strides for the output and both inputs. The final parameter, `nrc`, is part of the CPU trait interface that supports kernels processing multiple rows together.
486
+
487
+
### Common Type Pairings
488
+
489
+
The CPU type-traits table in `src/ggml-cpu/ggml-cpu.c` shows the preferred dot-product partner for each quantized format through `vec_dot_type`. Common patterns include:
490
+
491
+
-`Q4_0 × Q8_0` via `ggml_vec_dot_q4_0_q8_0`
492
+
-`Q4_1 × Q8_1` via `ggml_vec_dot_q4_1_q8_1`
493
+
-`Q4_K × Q8_K` via `ggml_vec_dot_q4_K_q8_K`
494
+
-`Q5_K × Q8_K` via `ggml_vec_dot_q5_K_q8_K`
495
+
-`Q6_K × Q8_K` via `ggml_vec_dot_q6_K_q8_K`
496
+
-`IQ4_NL × Q8_0` via `ggml_vec_dot_iq4_nl_q8_0`
497
+
-`IQ4_XS × Q8_K` via `ggml_vec_dot_iq4_xs_q8_K`
498
+
-`MXFP4 × Q8_0` via `ggml_vec_dot_mxfp4_q8_0`
499
+
-`NVFP4 × Q8_0` via `ggml_vec_dot_nvfp4_q8_0`
500
+
-`TQ1_0 × Q8_K` via `ggml_vec_dot_tq1_0_q8_K`
501
+
-`TQ2_0 × Q8_K` via `ggml_vec_dot_tq2_0_q8_K`
502
+
503
+
### Why the Second Operand Uses `Q8_*`
504
+
505
+
The second operand is commonly stored or converted into a higher-precision quantized partner such as `Q8_0`, `Q8_1`, or `Q8_K`. That pattern is visible directly in the `vec_dot_type` field of the CPU traits table. The idea is to keep one operand in the compact low-bit storage format while using a more accurate partner format for the multiply-accumulate path.
506
+
507
+
### Role in the Quantization Pipeline
508
+
509
+
This direct `vec_dot` path is what makes GGML quantization practical for inference workloads. The flow is usually:
510
+
511
+
1. store weights in a compact quantized format such as `Q4_0`, `Q4_K`, or an `IQ*` format
512
+
2. use the backend’s preferred partner type for the other operand
513
+
3. run the dot product directly on packed blocks through the format-specific `vec_dot` kernel
514
+
515
+
This avoids a full dequantize-to-FP32 step before every inner-product computation.
0 commit comments