You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ggml/2_QuantizationSystem.md
+55Lines changed: 55 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -171,3 +171,58 @@ A useful way to read this family is:
171
171
- `Q2_K`, `Q4_K`, and `Q5_K` are **affine** formats, where weights are reconstructed as `x = a*q + b`.
172
172
- `Q3_K` and `Q6_K` are **scale-only** formats, where weights are reconstructed as `x = a*q`.
173
173
- `Q8_K` is marked in the code as a helper format for **intermediate quantization and dot products**, not as a general-purpose model storage format.
174
+
175
+
### I-Quants (Importance-Oriented Quantization)
176
+
177
+
- `src/ggml-common.h`
178
+
- `src/ggml-quants.c`
179
+
- `src/ggml.c`
180
+
181
+
The IQ family uses grid-based and lookup-based encodings to push compression below the classic `Q*_K` formats. In `src/ggml-common.h`, these formats are defined as `IQ1_S`, `IQ1_M`, `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS`, `IQ3_S`, `IQ4_NL`, and `IQ4_XS`, each with its own packed block layout. In `src/ggml-quants.c`, their quantizers use precomputed grids, lookup tables, neighbor searches, or non-linear codebooks rather than only uniform linear buckets.
182
+
183
+
| Type | Effective bits / value | Packed structure | Quantization style |
The effective bits-per-value numbers above come directly from the packed block sizes in `src/ggml-common.h`. For example, `block_iq2_xxs` is documented as “(Almost) true 2-bit quantization” but the block layout adds one FP16 scale per 256-value block, so the effective cost is 2.0625 bits per weight rather than exactly 2.0. The same pattern appears for `IQ3_XXS`, which is documented as “(Almost) true 3-bit quantization” and packs to 3.0625 bits per weight.
196
+
197
+
#### Grid and Lookup Structure
198
+
199
+
The low-bit IQ formats are implemented around precomputed grids and maps in `src/ggml-quants.c`:
200
+
201
+
- `IQ2_XXS`, `IQ2_XS`, `IQ1_S`, `IQ1_M`, and `IQ2_S` share helpers such as `iq2_data_index(...)` and `iq2_grid_size(...)`, and select different grids such as `kgrid_2bit_256`, `kgrid_2bit_512`, `kgrid_1bit_2048`, and `kgrid_2bit_1024`.
202
+
- `IQ3_XXS` and `IQ3_S` use `iq3_data[...]` tables together with maps and neighbor tables to snap local blocks to valid 3-bit grid points.
203
+
- `IQ4_NL` and `IQ4_XS` use the non-linear value table `kvalues_iq4nl` through `quantize_row_iq4_nl_impl(...)`.
204
+
205
+
This is why the IQ family is better described as **grid-coded quantization** rather than plain linear quantization. The quantizer is not just scaling and rounding into evenly spaced buckets; it is selecting encodings from structured low-bit codebooks.
206
+
207
+
#### Importance Weights
208
+
209
+
At the GGML layer, the IQ quantizers accept an optional weighting input through the parameter:
210
+
211
+
```c
212
+
size_t quantize_iq2_xxs(
213
+
const float * src,
214
+
void * dst,
215
+
int64_t nrow,
216
+
int64_t n_per_row,
217
+
const float * quant_weights);
218
+
```
219
+
220
+
The same `const float * quant_weights` parameter is used by `quantize_iq2_xs`, `quantize_iq2_s`, `quantize_iq3_xxs`, `quantize_iq3_s`, `quantize_iq1_s`, `quantize_iq1_m`, `quantize_iq4_nl`, and `quantize_iq4_xs`. Inside the quantizers, when `quant_weights` is present, it influences the per-element error weighting; when it is `NULL`, the routines fall back to internally derived weights such as `x[i] * x[i]` or related magnitude-based heuristics.
221
+
222
+
So, in GGML terms, IQ quantization supports **importance-weighted quantization**, but the API is phrased in terms of `quant_weights` rather than a hard-coded `imatrix` requirement. Some IQ formats also expose only dequantization in the generic type-traits table: for example, `IQ2_XXS`, `IQ2_XS`, `IQ1_S`, and `IQ1_M` have `from_float_ref = NULL`, while `IQ3_XXS`, `IQ3_S`, `IQ2_S`, `IQ4_NL`, and `IQ4_XS` register reference quantizers.
223
+
224
+
#### Notes
225
+
226
+
-`IQ4_NL` is called a “non-linear” 4-bit format, but its packed storage cost is 4.5 bits per value because each 32-value block also stores one FP16 scale.
227
+
-`IQ4_XS` uses a 256-value super-block and stores both high and low parts of the scale metadata, which is why its effective storage cost is 4.25 bits per value.
228
+
- The IQ family mixes two block granularities: most IQ formats use `QK_K = 256`, while `IQ4_NL` uses `QK4_NL = 32`.
0 commit comments