Skip to content

maxbaluev/maxsim-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

maxsim

ColBERT-style late-interaction MaxSim scoring for multi-vector retrieval — pure Rust, zero dependencies, builds anywhere offline.

Late interaction in 60 seconds

Most dense retrievers pool a whole query or document into a single vector and compare them with one dot product. That early compression is fast but lossy: token-level structure — which word matched which — is gone before scoring even begins. Late interaction keeps one embedding per token and defers the comparison to scoring time, recovering that fidelity.

The scoring rule is MaxSim. Given a query Q and document D, each represented as a bag of per-token vectors:

score(Q, D) = Σ_i  max_j  sim(q_i, d_j)

For each query token q_i, find its single best-matching document token (the max over the document's tokens), then sum those maxima across the query. Every query term independently "finds its evidence" anywhere in the document, and the score measures how completely the document covers the query, term by term. Because it is a sum over query tokens, the score is unbounded above and only comparable across documents for a fixed query — which is exactly what ranking needs.

This per-token matching is why late-interaction reranking is consistently more faithful than cosine over pooled embeddings: a single pooled vector cannot represent "this passage nails term A but is weak on term B," whereas MaxSim can. The approach was introduced by ColBERT (Khattab & Zaharia, SIGIR 2020) and later extended to images by ColPali (Faysse et al., 2024), which scores document-image patch embeddings against query tokens with the same MaxSim aggregate.

Usage

use maxsim::{max_sim, max_sim_normalized, l2_normalize_tokens, rank};

// Each token is a vector; a query or document is a sequence of token vectors.
let query = vec![vec![1.0, 0.0], vec![0.0, 1.0]];

let docs = vec![
    vec![vec![1.0, 0.0]],                  // doc 0: covers one query token
    vec![vec![1.0, 0.0], vec![0.0, 1.0]],  // doc 1: covers both
];

// Score a single document.
let s = max_sim(&query, &docs[1]);

// Rank a whole corpus — returns (doc_index, score), best first.
let ranking = rank(&query, &docs);
assert_eq!(ranking[0].0, 1);

// Fast path for pre-normalized embeddings: cosine reduces to a dot product,
// so normalize once and reuse.
let qn = l2_normalize_tokens(&query);
let dn: Vec<_> = docs.iter().map(|d| l2_normalize_tokens(d)).collect();
let s_fast = max_sim_normalized(&qn, &dn[1]);

API

Function Purpose
cosine(a, b) Cosine similarity between two &[f32]; safe on zero-norm and length mismatch.
max_sim(query, doc) MaxSim using cosine as the per-pair similarity.
max_sim_normalized(query, doc) MaxSim using the dot product — the fast path for L2-normalized inputs.
l2_normalize_tokens(tokens) L2-normalize each token vector for use with max_sim_normalized.
rank(query, docs) Rank documents by descending MaxSim; returns Vec<(index, score)>.

Fast-path tradeoff. When embeddings are already L2-normalized (the usual ColBERT setup), cosine(a, b) == dot(a, b), so max_sim_normalized skips recomputing norms for every (query_token, doc_token) pair. The caller owns that invariant: feed it un-normalized vectors and you get a raw dot-product aggregate, not a cosine MaxSim. Normalize once with l2_normalize_tokens and reuse the result across many scorings.

Edge-case policy (never panics)

  • Empty query or empty document → score 0.0.
  • Mismatched per-token dimensions → that pair contributes 0.0 (treated as orthogonal), so one malformed token can't poison the whole aggregate.
  • Zero-norm token → cosine 0.0 against everything (no division by zero).
  • Empty corpus → rank returns an empty Vec; ties preserve input order (stable sort).

Build and test

cargo build
cargo test                 # unit tests + doctests
cargo run --example rerank # toy reranking demo

No network and no dependencies are required — the crate is pure std.

License

MIT. See LICENSE. Copyright (c) 2026 Max Baluev.

About

ColBERT-style late-interaction MaxSim scoring for multi-vector retrieval — tiny, zero-dependency Rust crate

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages