Skip to content

Commit dcb8e2f

Browse files
committed
docs(ggml): Vector Dot Products
1 parent 3f2c958 commit dcb8e2f

1 file changed

Lines changed: 53 additions & 0 deletions

File tree

ggml/2_QuantizationSystem.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -460,3 +460,56 @@ The quantization flow in GGML is organized around three layers:
460460
- **reference packing** in `src/ggml-quants.c`
461461
- **format metadata and conversion hooks** through the core type-traits table
462462
- **backend-specific execution hooks** through the CPU trait table, especially `from_float` and `vec_dot`
463+
464+
## Vector Dot Products
465+
466+
The most performance-critical quantized operation in GGML is the **direct dot product on packed quantized blocks**. Instead of dequantizing both inputs to FP32 first, GGML uses per-format CPU kernels through the `vec_dot` hook in `struct ggml_type_traits_cpu`. This is the mechanism that lets formats such as `Q4_0`, `Q4_K`, `IQ4_NL`, `MXFP4`, and `TQ1_0` participate in fast low-level arithmetic on packed data.
467+
468+
### `ggml_vec_dot_t`
469+
470+
The CPU dot-product callback type is declared in `include/ggml-cpu.h` as:
471+
472+
```c
473+
typedef void (*ggml_vec_dot_t)(
474+
int n,
475+
float * GGML_RESTRICT s,
476+
size_t bs,
477+
const void * GGML_RESTRICT x,
478+
size_t bx,
479+
const void * GGML_RESTRICT y,
480+
size_t by,
481+
int nrc
482+
);
483+
```
484+
485+
This interface takes the logical element count `n`, writes the result to `s`, and accepts explicit byte strides for the output and both inputs. The final parameter, `nrc`, is part of the CPU trait interface that supports kernels processing multiple rows together.
486+
487+
### Common Type Pairings
488+
489+
The CPU type-traits table in `src/ggml-cpu/ggml-cpu.c` shows the preferred dot-product partner for each quantized format through `vec_dot_type`. Common patterns include:
490+
491+
- `Q4_0 × Q8_0` via `ggml_vec_dot_q4_0_q8_0`
492+
- `Q4_1 × Q8_1` via `ggml_vec_dot_q4_1_q8_1`
493+
- `Q4_K × Q8_K` via `ggml_vec_dot_q4_K_q8_K`
494+
- `Q5_K × Q8_K` via `ggml_vec_dot_q5_K_q8_K`
495+
- `Q6_K × Q8_K` via `ggml_vec_dot_q6_K_q8_K`
496+
- `IQ4_NL × Q8_0` via `ggml_vec_dot_iq4_nl_q8_0`
497+
- `IQ4_XS × Q8_K` via `ggml_vec_dot_iq4_xs_q8_K`
498+
- `MXFP4 × Q8_0` via `ggml_vec_dot_mxfp4_q8_0`
499+
- `NVFP4 × Q8_0` via `ggml_vec_dot_nvfp4_q8_0`
500+
- `TQ1_0 × Q8_K` via `ggml_vec_dot_tq1_0_q8_K`
501+
- `TQ2_0 × Q8_K` via `ggml_vec_dot_tq2_0_q8_K`
502+
503+
### Why the Second Operand Uses `Q8_*`
504+
505+
The second operand is commonly stored or converted into a higher-precision quantized partner such as `Q8_0`, `Q8_1`, or `Q8_K`. That pattern is visible directly in the `vec_dot_type` field of the CPU traits table. The idea is to keep one operand in the compact low-bit storage format while using a more accurate partner format for the multiply-accumulate path.
506+
507+
### Role in the Quantization Pipeline
508+
509+
This direct `vec_dot` path is what makes GGML quantization practical for inference workloads. The flow is usually:
510+
511+
1. store weights in a compact quantized format such as `Q4_0`, `Q4_K`, or an `IQ*` format
512+
2. use the backend’s preferred partner type for the other operand
513+
3. run the dot product directly on packed blocks through the format-specific `vec_dot` kernel
514+
515+
This avoids a full dequantize-to-FP32 step before every inner-product computation.

0 commit comments

Comments
 (0)