perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking by BubbleCal · Pull Request #7205 · lance-format/lance

BubbleCal · 2026-06-10T14:06:01Z

Problem

Multi-bit RaBitQ (num_bits >= 2) reranking computes sum_d query[d] * ex_code[d] per candidate.
Before this change it had two slow paths:

The top-k heap pruning path called compute_single_rq_ex_distance per candidate: scalar
bit extraction per dimension plus a gather from a dim * 2^ex_bits f32 table (1 MiB at
ex_bits=8, dim=1024 — cache hostile).
FastScan only covered ex_bits in {2, 4, 8} and re-quantizes the ex table to u8, adding
quantization error. For ex_bits in {1, 3, 5, 6, 7} everything went through the scalar path.

Change

Following the kernel design of the RaBitQ reference library
(VectorDB-NTU/RaBitQ-Library, Apache-2.0),
this adds dedicated inner-product kernels (vector/bq/ex_dot.rs) that unpack packed ex codes
with shifts/masks and FMA them against the f32 query directly — no LUT, no extra quantization
error:

All ex_bits 1..=8, with scalar, AVX2, AVX-512, and NEON variants (runtime dispatch,
resolved once per dist calculator).
ex_bits in {1, 2, 4, 8}: the sequential LSB-first rows are byte-aligned and consumed
as stored.
ex_bits in {3, 5, 6, 7}: rows are repacked once at load time into bit-plane layouts
(3 = 2+1, 5 = 4+1, 6 = 6|2, 7 = 6|2+1 per 64-dim group) that unpack with byte-wise shifts.
The rotated query is permuted once per search into the kernels' unpack order
(build_ex_query_into) and zero-padded to a multiple of 64, so kernels read both sides
sequentially with no per-candidate shuffles. Arbitrary dims are supported.
The quantized ex dist table is now only built for the FastScan widths {2, 4, 8}; the other
widths skip the table entirely (e.g. saves a 768 KiB table build per query at ex_bits=7,
dim=1536).

The on-disk index format is unchanged. The bit-plane repack is an in-memory derived copy
(mirroring the existing packed_ex_codes precedent); old indexes load as-is and new writes
are byte-identical to before.

Benchmarks

Per-candidate ex rerank, 1024 rows, cargo bench --bench rq -- ex_dot (kernel vs the previous
table-gather on identical data).

AVX-512 (GCP c3-standard-8, Sapphire Rapids), dim=1024:

ex_bits	kernel	table-gather	speedup
1	110.9 µs	1.543 ms	13.9x
2	60.5 µs	1.556 ms	25.7x
3	91.5 µs	1.554 ms	17.0x
4	74.1 µs	1.639 ms	22.1x
5	87.8 µs	1.651 ms	18.8x
6	72.6 µs	1.654 ms	22.8x
7	101.6 µs	1.708 ms	16.8x
8	60.9 µs	1.757 ms	28.9x

AVX-512, production embedding dims (bit-plane widths):

ex_bits	dim	kernel	table-gather	speedup
3	1536	172.6 µs	2.464 ms	14.3x
5	1536	187.9 µs	2.534 ms	13.5x
3	2048	234.8 µs	3.264 ms	13.9x
5	2048	232.6 µs	3.377 ms	14.5x

NEON (Apple M-series), dim=1024:

ex_bits	kernel	table-gather	speedup
1	90.0 µs	787 µs	8.7x
2	70.5 µs	775 µs	11.0x
3	81.1 µs	777 µs	9.6x
4	63.9 µs	784 µs	12.3x
5	78.3 µs	802 µs	10.2x
6	71.9 µs	880 µs	12.2x
7	85.4 µs	887 µs	10.4x
8	62.2 µs	888 µs	14.3x

The gather baseline runs with a hot LUT here; on real queries the large tables are cold, so the
gap is larger in practice.

Testing

Differential tests: every kernel variant (scalar/AVX2/AVX-512/NEON) x ex_bits 1..=8 x dims
{7, 16, 60, 64, 100, 128, 1024, 1536, 2048} against an f64 reference, plus a dense dim sweep
(1..=160 and 255/256/1000/1536/2048) for the bit-plane widths 3 and 5, plus pack/unpack
round-trips. Run on both aarch64 (NEON) and Sapphire Rapids (AVX2 + AVX-512 paths exercised).
Storage-level reference test across all widths (num_bits 2..=9) at dims 72 and 1536,
covering the exact per-candidate path and the bulk path.
remap test extended to the bit-plane widths (num_bits 4/6/8) with a remap == fresh-reload
equivalence assertion; end-to-end IVF_RQ build+search+recall tests extended to num_bits 4
and 6 (no-FastScan widths).
Full lance-index (487) and lance IVF suites pass on both architectures; the bq suite also
passes under AddressSanitizer on x86_64 (no OOB in the unsafe SIMD paths).
cargo fmt --all and cargo clippy --all --tests --benches -- -D warnings are clean.

Breaking changes

On-disk format: none. Rust API only (same pattern as #7179): RabitDistCalculator::new gains
an ex_query parameter and RabitRawQueryContext gains an ex_query field;
RabitRawQueryContext::ex_dist_table is now empty for the non-FastScan widths. Python/Java
bindings are unaffected.

🤖 Generated with Claude Code

Replace the per-candidate table-gather ex distance (dim * 2^ex_bits f32 LUT) with direct f32-query x packed-code FMA kernels for all ex_bits 1..=8, with scalar/AVX2/AVX-512/NEON variants. ex_bits {1,2,4,8} consume the sequential packed rows as stored; {3,5,6,7} are repacked once at load into bit-plane layouts. On-disk index format is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-06-10T14:46:40Z

Codecov Report

❌ Patch coverage is 88.19227% with 113 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/vector/bq/ex_dot.rs	82.13%	78 Missing and 4 partials ⚠️
rust/lance-index/src/vector/bq/storage.rs	93.42%	20 Missing and 10 partials ⚠️
...lance-index/src/vector/distributed/index_merger.rs	93.75%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

wjones127

This is marked as a breaking change. I didn't see an API change in the Lance crate? Am I missing something?

wjones127 · 2026-06-10T17:57:18Z

+        if std::arch::is_x86_feature_detected!("avx512f") {
+            return x86::avx512_kernel(ex_bits);
+        }
+        if std::arch::is_x86_feature_detected!("avx2") && std::arch::is_x86_feature_detected!("fma")
+        {
+            return x86::avx2_kernel(ex_bits);
+        }


praise: I love the dynamic dispatch 😍

BubbleCal · 2026-06-11T04:44:56Z

This is marked as a breaking change. I didn't see an API change in the Lance crate? Am I missing something?

Oops, I have one more commit to change the exbits layout, but it won't impact existing num_bits=1 IVF_RQ indices

Replace the sequential LSB-first ex-code layout with a blocked layout (64-dim blocks, bit-interleaved so the SIMD unpack emits natural dim order) written under a new column (__blocked_ex_codes): - The ex-dot kernels read the rotated query directly (no per-query permutation) and consume rows as stored (no load-time repack, no second resident copy). - Legacy indexes remain readable: sequential rows are repacked at load and the batch is normalized, so rewrites (remap, optimize merges) emit the blocked format. Older lance versions fail loudly on new multi-bit indexes (missing-column error) instead of misreading them. - num_bits=1 indexes carry no ex codes and are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The FastScan ex bulk path is only reachable when lower-bound gating is disabled (legacy indexes without error factors); gated indexes rerank per candidate with the ex-dot kernels. Build its artifacts accordingly: - Compute the u8 ex LUT directly from the rotated query (the per-dim table is the pure multiplication q[d] * code), removing the dim * 2^ex_bits f32 table from the query context, the calculator, and the scratch layout entirely. This also speeds the bypass path itself up by 3-6%. - Skip the FastScan transpose (packed_ex_codes) when error factors are present, saving one resident copy of the ex codes and the per-load transpose for every fresh index. Bulk scoring on gated indexes falls through to the exact per-row kernels. The LUT bulk path stays for legacy indexes: at dim=1536 it scores 2.5x (ex_bits=4) to 7x (ex_bits=2) faster per row than the kernel loop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions Bot added A-index Vector index, linalg, tokenizer performance breaking-change labels Jun 10, 2026

wjones127 approved these changes Jun 10, 2026

View reviewed changes

BubbleCal and others added 2 commits June 11, 2026 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking#7205

perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking#7205
BubbleCal wants to merge 3 commits into
mainfrom
yang/rq-ex-dot-simd-kernels

BubbleCal commented Jun 10, 2026

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

wjones127 left a comment

Uh oh!

wjones127 Jun 10, 2026

Uh oh!

BubbleCal commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BubbleCal commented Jun 10, 2026

Problem

Change

Benchmarks

Testing

Breaking changes

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

BubbleCal commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 10, 2026 •

edited

Loading