perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking#7205
Open
BubbleCal wants to merge 3 commits into
Open
perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking#7205BubbleCal wants to merge 3 commits into
BubbleCal wants to merge 3 commits into
Conversation
Replace the per-candidate table-gather ex distance (dim * 2^ex_bits f32 LUT)
with direct f32-query x packed-code FMA kernels for all ex_bits 1..=8, with
scalar/AVX2/AVX-512/NEON variants. ex_bits {1,2,4,8} consume the sequential
packed rows as stored; {3,5,6,7} are repacked once at load into bit-plane
layouts. On-disk index format is unchanged.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
wjones127
approved these changes
Jun 10, 2026
wjones127
left a comment
Contributor
There was a problem hiding this comment.
This is marked as a breaking change. I didn't see an API change in the Lance crate? Am I missing something?
Comment on lines
+299
to
+305
| if std::arch::is_x86_feature_detected!("avx512f") { | ||
| return x86::avx512_kernel(ex_bits); | ||
| } | ||
| if std::arch::is_x86_feature_detected!("avx2") && std::arch::is_x86_feature_detected!("fma") | ||
| { | ||
| return x86::avx2_kernel(ex_bits); | ||
| } |
Contributor
There was a problem hiding this comment.
praise: I love the dynamic dispatch 😍
Contributor
Author
Oops, I have one more commit to change the |
Replace the sequential LSB-first ex-code layout with a blocked layout (64-dim blocks, bit-interleaved so the SIMD unpack emits natural dim order) written under a new column (__blocked_ex_codes): - The ex-dot kernels read the rotated query directly (no per-query permutation) and consume rows as stored (no load-time repack, no second resident copy). - Legacy indexes remain readable: sequential rows are repacked at load and the batch is normalized, so rewrites (remap, optimize merges) emit the blocked format. Older lance versions fail loudly on new multi-bit indexes (missing-column error) instead of misreading them. - num_bits=1 indexes carry no ex codes and are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The FastScan ex bulk path is only reachable when lower-bound gating is disabled (legacy indexes without error factors); gated indexes rerank per candidate with the ex-dot kernels. Build its artifacts accordingly: - Compute the u8 ex LUT directly from the rotated query (the per-dim table is the pure multiplication q[d] * code), removing the dim * 2^ex_bits f32 table from the query context, the calculator, and the scratch layout entirely. This also speeds the bypass path itself up by 3-6%. - Skip the FastScan transpose (packed_ex_codes) when error factors are present, saving one resident copy of the ex codes and the per-load transpose for every fresh index. Bulk scoring on gated indexes falls through to the exact per-row kernels. The LUT bulk path stays for legacy indexes: at dim=1536 it scores 2.5x (ex_bits=4) to 7x (ex_bits=2) faster per row than the kernel loop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Multi-bit RaBitQ (
num_bits >= 2) reranking computessum_d query[d] * ex_code[d]per candidate.Before this change it had two slow paths:
compute_single_rq_ex_distanceper candidate: scalarbit extraction per dimension plus a gather from a
dim * 2^ex_bitsf32 table (1 MiB atex_bits=8,dim=1024— cache hostile).ex_bitsin {2, 4, 8} and re-quantizes the ex table to u8, addingquantization error. For
ex_bitsin {1, 3, 5, 6, 7} everything went through the scalar path.Change
Following the kernel design of the RaBitQ reference library
(VectorDB-NTU/RaBitQ-Library, Apache-2.0),
this adds dedicated inner-product kernels (
vector/bq/ex_dot.rs) that unpack packed ex codeswith shifts/masks and FMA them against the f32 query directly — no LUT, no extra quantization
error:
ex_bits1..=8, with scalar, AVX2, AVX-512, and NEON variants (runtime dispatch,resolved once per dist calculator).
ex_bitsin {1, 2, 4, 8}: the sequential LSB-first rows are byte-aligned and consumedas stored.
ex_bitsin {3, 5, 6, 7}: rows are repacked once at load time into bit-plane layouts(3 = 2+1, 5 = 4+1, 6 = 6|2, 7 = 6|2+1 per 64-dim group) that unpack with byte-wise shifts.
(
build_ex_query_into) and zero-padded to a multiple of 64, so kernels read both sidessequentially with no per-candidate shuffles. Arbitrary dims are supported.
widths skip the table entirely (e.g. saves a 768 KiB table build per query at
ex_bits=7,dim=1536).The on-disk index format is unchanged. The bit-plane repack is an in-memory derived copy
(mirroring the existing
packed_ex_codesprecedent); old indexes load as-is and new writesare byte-identical to before.
Benchmarks
Per-candidate ex rerank, 1024 rows,
cargo bench --bench rq -- ex_dot(kernel vs the previoustable-gather on identical data).
AVX-512 (GCP
c3-standard-8, Sapphire Rapids),dim=1024:AVX-512, production embedding dims (bit-plane widths):
NEON (Apple M-series),
dim=1024:The gather baseline runs with a hot LUT here; on real queries the large tables are cold, so the
gap is larger in practice.
Testing
ex_bits1..=8 x dims{7, 16, 60, 64, 100, 128, 1024, 1536, 2048} against an f64 reference, plus a dense dim sweep
(1..=160 and 255/256/1000/1536/2048) for the bit-plane widths 3 and 5, plus pack/unpack
round-trips. Run on both aarch64 (NEON) and Sapphire Rapids (AVX2 + AVX-512 paths exercised).
num_bits2..=9) at dims 72 and 1536,covering the exact per-candidate path and the bulk path.
remaptest extended to the bit-plane widths (num_bits4/6/8) with a remap == fresh-reloadequivalence assertion; end-to-end IVF_RQ build+search+recall tests extended to
num_bits4and 6 (no-FastScan widths).
lance-index(487) andlanceIVF suites pass on both architectures; the bq suite alsopasses under AddressSanitizer on x86_64 (no OOB in the unsafe SIMD paths).
cargo fmt --allandcargo clippy --all --tests --benches -- -D warningsare clean.Breaking changes
On-disk format: none. Rust API only (same pattern as #7179):
RabitDistCalculator::newgainsan
ex_queryparameter andRabitRawQueryContextgains anex_queryfield;RabitRawQueryContext::ex_dist_tableis now empty for the non-FastScan widths. Python/Javabindings are unaffected.
🤖 Generated with Claude Code