Skip to content

Commit cf9543a

Browse files
committed
docs(ggml): I-Quants
1 parent a5e9eec commit cf9543a

1 file changed

Lines changed: 55 additions & 0 deletions

File tree

ggml/2_QuantizationSystem.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,3 +171,58 @@ A useful way to read this family is:
171171
- `Q2_K`, `Q4_K`, and `Q5_K` are **affine** formats, where weights are reconstructed as `x = a*q + b`.
172172
- `Q3_K` and `Q6_K` are **scale-only** formats, where weights are reconstructed as `x = a*q`.
173173
- `Q8_K` is marked in the code as a helper format for **intermediate quantization and dot products**, not as a general-purpose model storage format.
174+
175+
### I-Quants (Importance-Oriented Quantization)
176+
177+
- `src/ggml-common.h`
178+
- `src/ggml-quants.c`
179+
- `src/ggml.c`
180+
181+
The IQ family uses grid-based and lookup-based encodings to push compression below the classic `Q*_K` formats. In `src/ggml-common.h`, these formats are defined as `IQ1_S`, `IQ1_M`, `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS`, `IQ3_S`, `IQ4_NL`, and `IQ4_XS`, each with its own packed block layout. In `src/ggml-quants.c`, their quantizers use precomputed grids, lookup tables, neighbor searches, or non-linear codebooks rather than only uniform linear buckets.
182+
183+
| Type | Effective bits / value | Packed structure | Quantization style |
184+
| --------- | ---------------------: | ---------------------------------------------------------------------------------- | ------------------------------------------ |
185+
| `IQ1_S` | 1.5625 | `ggml_half d` + `qs[QK_K/8]` + `qh[QK_K/32]` | 1-bit grid / indexed encoding |
186+
| `IQ1_M` | 1.75 | `qs[QK_K/8]` + `qh[QK_K/16]` + `scales[QK_K/32]` | extended 1-bit grid encoding |
187+
| `IQ2_XXS` | 2.0625 | `ggml_half d` + `uint16_t qs[QK_K/8]` | compact 2-bit lookup-grid format |
188+
| `IQ2_XS` | 2.3125 | `ggml_half d` + `uint16_t qs[QK_K/8]` + `scales[QK_K/32]` | 2-bit lookup-grid format with extra scales |
189+
| `IQ2_S` | 2.5625 | `ggml_half d` + `qs[QK_K/4]` + `qh[QK_K/32]` + `scales[QK_K/32]` | signed 2-bit grid format |
190+
| `IQ3_XXS` | 3.0625 | `ggml_half d` + `qs[3*QK_K/8]` | compact 3-bit lookup-grid format |
191+
| `IQ3_S` | 3.4375 | `ggml_half d` + `qs[QK_K/4]` + `qh[QK_K/32]` + `signs[QK_K/8]` + `scales[QK_K/64]` | enhanced 3-bit grid format |
192+
| `IQ4_NL` | 4.5 | `ggml_half d` + `qs[QK4_NL/2]` | non-linear 4-bit codebook |
193+
| `IQ4_XS` | 4.25 | `ggml_half d` + `scales_h` + `scales_l[QK_K/64]` + `qs[QK_K/2]` | super-block non-linear 4-bit format |
194+
195+
The effective bits-per-value numbers above come directly from the packed block sizes in `src/ggml-common.h`. For example, `block_iq2_xxs` is documented as “(Almost) true 2-bit quantization” but the block layout adds one FP16 scale per 256-value block, so the effective cost is 2.0625 bits per weight rather than exactly 2.0. The same pattern appears for `IQ3_XXS`, which is documented as “(Almost) true 3-bit quantization” and packs to 3.0625 bits per weight.
196+
197+
#### Grid and Lookup Structure
198+
199+
The low-bit IQ formats are implemented around precomputed grids and maps in `src/ggml-quants.c`:
200+
201+
- `IQ2_XXS`, `IQ2_XS`, `IQ1_S`, `IQ1_M`, and `IQ2_S` share helpers such as `iq2_data_index(...)` and `iq2_grid_size(...)`, and select different grids such as `kgrid_2bit_256`, `kgrid_2bit_512`, `kgrid_1bit_2048`, and `kgrid_2bit_1024`.
202+
- `IQ3_XXS` and `IQ3_S` use `iq3_data[...]` tables together with maps and neighbor tables to snap local blocks to valid 3-bit grid points.
203+
- `IQ4_NL` and `IQ4_XS` use the non-linear value table `kvalues_iq4nl` through `quantize_row_iq4_nl_impl(...)`.
204+
205+
This is why the IQ family is better described as **grid-coded quantization** rather than plain linear quantization. The quantizer is not just scaling and rounding into evenly spaced buckets; it is selecting encodings from structured low-bit codebooks.
206+
207+
#### Importance Weights
208+
209+
At the GGML layer, the IQ quantizers accept an optional weighting input through the parameter:
210+
211+
```c
212+
size_t quantize_iq2_xxs(
213+
const float * src,
214+
void * dst,
215+
int64_t nrow,
216+
int64_t n_per_row,
217+
const float * quant_weights);
218+
```
219+
220+
The same `const float * quant_weights` parameter is used by `quantize_iq2_xs`, `quantize_iq2_s`, `quantize_iq3_xxs`, `quantize_iq3_s`, `quantize_iq1_s`, `quantize_iq1_m`, `quantize_iq4_nl`, and `quantize_iq4_xs`. Inside the quantizers, when `quant_weights` is present, it influences the per-element error weighting; when it is `NULL`, the routines fall back to internally derived weights such as `x[i] * x[i]` or related magnitude-based heuristics.
221+
222+
So, in GGML terms, IQ quantization supports **importance-weighted quantization**, but the API is phrased in terms of `quant_weights` rather than a hard-coded `imatrix` requirement. Some IQ formats also expose only dequantization in the generic type-traits table: for example, `IQ2_XXS`, `IQ2_XS`, `IQ1_S`, and `IQ1_M` have `from_float_ref = NULL`, while `IQ3_XXS`, `IQ3_S`, `IQ2_S`, `IQ4_NL`, and `IQ4_XS` register reference quantizers.
223+
224+
#### Notes
225+
226+
- `IQ4_NL` is called a “non-linear” 4-bit format, but its packed storage cost is 4.5 bits per value because each 32-value block also stores one FP16 scale.
227+
- `IQ4_XS` uses a 256-value super-block and stores both high and low parts of the scale metadata, which is why its effective storage cost is 4.25 bits per value.
228+
- The IQ family mixes two block granularities: most IQ formats use `QK_K = 256`, while `IQ4_NL` uses `QK4_NL = 32`.

0 commit comments

Comments
 (0)