ColBERT-style late-interaction MaxSim scoring for multi-vector retrieval — pure Rust, zero dependencies, builds anywhere offline.
Most dense retrievers pool a whole query or document into a single vector and compare them with one dot product. That early compression is fast but lossy: token-level structure — which word matched which — is gone before scoring even begins. Late interaction keeps one embedding per token and defers the comparison to scoring time, recovering that fidelity.
The scoring rule is MaxSim. Given a query Q and document D, each represented as a bag of
per-token vectors:
score(Q, D) = Σ_i max_j sim(q_i, d_j)
For each query token q_i, find its single best-matching document token (the max over the
document's tokens), then sum those maxima across the query. Every query term independently
"finds its evidence" anywhere in the document, and the score measures how completely the
document covers the query, term by term. Because it is a sum over query tokens, the score is
unbounded above and only comparable across documents for a fixed query — which is exactly what
ranking needs.
This per-token matching is why late-interaction reranking is consistently more faithful than cosine over pooled embeddings: a single pooled vector cannot represent "this passage nails term A but is weak on term B," whereas MaxSim can. The approach was introduced by ColBERT (Khattab & Zaharia, SIGIR 2020) and later extended to images by ColPali (Faysse et al., 2024), which scores document-image patch embeddings against query tokens with the same MaxSim aggregate.
use maxsim::{max_sim, max_sim_normalized, l2_normalize_tokens, rank};
// Each token is a vector; a query or document is a sequence of token vectors.
let query = vec![vec![1.0, 0.0], vec![0.0, 1.0]];
let docs = vec![
vec![vec![1.0, 0.0]], // doc 0: covers one query token
vec![vec![1.0, 0.0], vec![0.0, 1.0]], // doc 1: covers both
];
// Score a single document.
let s = max_sim(&query, &docs[1]);
// Rank a whole corpus — returns (doc_index, score), best first.
let ranking = rank(&query, &docs);
assert_eq!(ranking[0].0, 1);
// Fast path for pre-normalized embeddings: cosine reduces to a dot product,
// so normalize once and reuse.
let qn = l2_normalize_tokens(&query);
let dn: Vec<_> = docs.iter().map(|d| l2_normalize_tokens(d)).collect();
let s_fast = max_sim_normalized(&qn, &dn[1]);| Function | Purpose |
|---|---|
cosine(a, b) |
Cosine similarity between two &[f32]; safe on zero-norm and length mismatch. |
max_sim(query, doc) |
MaxSim using cosine as the per-pair similarity. |
max_sim_normalized(query, doc) |
MaxSim using the dot product — the fast path for L2-normalized inputs. |
l2_normalize_tokens(tokens) |
L2-normalize each token vector for use with max_sim_normalized. |
rank(query, docs) |
Rank documents by descending MaxSim; returns Vec<(index, score)>. |
Fast-path tradeoff. When embeddings are already L2-normalized (the usual ColBERT setup),
cosine(a, b) == dot(a, b), so max_sim_normalized skips recomputing norms for every
(query_token, doc_token) pair. The caller owns that invariant: feed it un-normalized vectors
and you get a raw dot-product aggregate, not a cosine MaxSim. Normalize once with
l2_normalize_tokens and reuse the result across many scorings.
- Empty query or empty document → score
0.0. - Mismatched per-token dimensions → that pair contributes
0.0(treated as orthogonal), so one malformed token can't poison the whole aggregate. - Zero-norm token → cosine
0.0against everything (no division by zero). - Empty corpus →
rankreturns an emptyVec; ties preserve input order (stable sort).
cargo build
cargo test # unit tests + doctests
cargo run --example rerank # toy reranking demoNo network and no dependencies are required — the crate is pure std.
MIT. See LICENSE. Copyright (c) 2026 Max Baluev.