Skip to content

perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking#7205

Open
BubbleCal wants to merge 3 commits into
mainfrom
yang/rq-ex-dot-simd-kernels
Open

perf(vector)!: add dedicated SIMD kernels for RaBitQ ex-code reranking#7205
BubbleCal wants to merge 3 commits into
mainfrom
yang/rq-ex-dot-simd-kernels

Conversation

@BubbleCal

Copy link
Copy Markdown
Contributor

Problem

Multi-bit RaBitQ (num_bits >= 2) reranking computes sum_d query[d] * ex_code[d] per candidate.
Before this change it had two slow paths:

  • The top-k heap pruning path called compute_single_rq_ex_distance per candidate: scalar
    bit extraction per dimension plus a gather from a dim * 2^ex_bits f32 table (1 MiB at
    ex_bits=8, dim=1024 — cache hostile).
  • FastScan only covered ex_bits in {2, 4, 8} and re-quantizes the ex table to u8, adding
    quantization error. For ex_bits in {1, 3, 5, 6, 7} everything went through the scalar path.

Change

Following the kernel design of the RaBitQ reference library
(VectorDB-NTU/RaBitQ-Library, Apache-2.0),
this adds dedicated inner-product kernels (vector/bq/ex_dot.rs) that unpack packed ex codes
with shifts/masks and FMA them against the f32 query directly — no LUT, no extra quantization
error:

  • All ex_bits 1..=8, with scalar, AVX2, AVX-512, and NEON variants (runtime dispatch,
    resolved once per dist calculator).
  • ex_bits in {1, 2, 4, 8}: the sequential LSB-first rows are byte-aligned and consumed
    as stored.
  • ex_bits in {3, 5, 6, 7}: rows are repacked once at load time into bit-plane layouts
    (3 = 2+1, 5 = 4+1, 6 = 6|2, 7 = 6|2+1 per 64-dim group) that unpack with byte-wise shifts.
  • The rotated query is permuted once per search into the kernels' unpack order
    (build_ex_query_into) and zero-padded to a multiple of 64, so kernels read both sides
    sequentially with no per-candidate shuffles. Arbitrary dims are supported.
  • The quantized ex dist table is now only built for the FastScan widths {2, 4, 8}; the other
    widths skip the table entirely (e.g. saves a 768 KiB table build per query at ex_bits=7,
    dim=1536).

The on-disk index format is unchanged. The bit-plane repack is an in-memory derived copy
(mirroring the existing packed_ex_codes precedent); old indexes load as-is and new writes
are byte-identical to before.

Benchmarks

Per-candidate ex rerank, 1024 rows, cargo bench --bench rq -- ex_dot (kernel vs the previous
table-gather on identical data).

AVX-512 (GCP c3-standard-8, Sapphire Rapids), dim=1024:

ex_bits kernel table-gather speedup
1 110.9 µs 1.543 ms 13.9x
2 60.5 µs 1.556 ms 25.7x
3 91.5 µs 1.554 ms 17.0x
4 74.1 µs 1.639 ms 22.1x
5 87.8 µs 1.651 ms 18.8x
6 72.6 µs 1.654 ms 22.8x
7 101.6 µs 1.708 ms 16.8x
8 60.9 µs 1.757 ms 28.9x

AVX-512, production embedding dims (bit-plane widths):

ex_bits dim kernel table-gather speedup
3 1536 172.6 µs 2.464 ms 14.3x
5 1536 187.9 µs 2.534 ms 13.5x
3 2048 234.8 µs 3.264 ms 13.9x
5 2048 232.6 µs 3.377 ms 14.5x

NEON (Apple M-series), dim=1024:

ex_bits kernel table-gather speedup
1 90.0 µs 787 µs 8.7x
2 70.5 µs 775 µs 11.0x
3 81.1 µs 777 µs 9.6x
4 63.9 µs 784 µs 12.3x
5 78.3 µs 802 µs 10.2x
6 71.9 µs 880 µs 12.2x
7 85.4 µs 887 µs 10.4x
8 62.2 µs 888 µs 14.3x

The gather baseline runs with a hot LUT here; on real queries the large tables are cold, so the
gap is larger in practice.

Testing

  • Differential tests: every kernel variant (scalar/AVX2/AVX-512/NEON) x ex_bits 1..=8 x dims
    {7, 16, 60, 64, 100, 128, 1024, 1536, 2048} against an f64 reference, plus a dense dim sweep
    (1..=160 and 255/256/1000/1536/2048) for the bit-plane widths 3 and 5, plus pack/unpack
    round-trips. Run on both aarch64 (NEON) and Sapphire Rapids (AVX2 + AVX-512 paths exercised).
  • Storage-level reference test across all widths (num_bits 2..=9) at dims 72 and 1536,
    covering the exact per-candidate path and the bulk path.
  • remap test extended to the bit-plane widths (num_bits 4/6/8) with a remap == fresh-reload
    equivalence assertion; end-to-end IVF_RQ build+search+recall tests extended to num_bits 4
    and 6 (no-FastScan widths).
  • Full lance-index (487) and lance IVF suites pass on both architectures; the bq suite also
    passes under AddressSanitizer on x86_64 (no OOB in the unsafe SIMD paths).
  • cargo fmt --all and cargo clippy --all --tests --benches -- -D warnings are clean.

Breaking changes

On-disk format: none. Rust API only (same pattern as #7179): RabitDistCalculator::new gains
an ex_query parameter and RabitRawQueryContext gains an ex_query field;
RabitRawQueryContext::ex_dist_table is now empty for the non-FastScan widths. Python/Java
bindings are unaffected.

🤖 Generated with Claude Code

Replace the per-candidate table-gather ex distance (dim * 2^ex_bits f32 LUT)
with direct f32-query x packed-code FMA kernels for all ex_bits 1..=8, with
scalar/AVX2/AVX-512/NEON variants. ex_bits {1,2,4,8} consume the sequential
packed rows as stored; {3,5,6,7} are repacked once at load into bit-plane
layouts. On-disk index format is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance breaking-change labels Jun 10, 2026
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.19227% with 113 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/bq/ex_dot.rs 82.13% 78 Missing and 4 partials ⚠️
rust/lance-index/src/vector/bq/storage.rs 93.42% 20 Missing and 10 partials ⚠️
...lance-index/src/vector/distributed/index_merger.rs 93.75% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is marked as a breaking change. I didn't see an API change in the Lance crate? Am I missing something?

Comment on lines +299 to +305
if std::arch::is_x86_feature_detected!("avx512f") {
return x86::avx512_kernel(ex_bits);
}
if std::arch::is_x86_feature_detected!("avx2") && std::arch::is_x86_feature_detected!("fma")
{
return x86::avx2_kernel(ex_bits);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I love the dynamic dispatch 😍

@BubbleCal

Copy link
Copy Markdown
Contributor Author

This is marked as a breaking change. I didn't see an API change in the Lance crate? Am I missing something?

Oops, I have one more commit to change the exbits layout, but it won't impact existing num_bits=1 IVF_RQ indices

BubbleCal and others added 2 commits June 11, 2026 14:24
Replace the sequential LSB-first ex-code layout with a blocked layout
(64-dim blocks, bit-interleaved so the SIMD unpack emits natural dim
order) written under a new column (__blocked_ex_codes):

- The ex-dot kernels read the rotated query directly (no per-query
  permutation) and consume rows as stored (no load-time repack, no
  second resident copy).
- Legacy indexes remain readable: sequential rows are repacked at load
  and the batch is normalized, so rewrites (remap, optimize merges)
  emit the blocked format. Older lance versions fail loudly on new
  multi-bit indexes (missing-column error) instead of misreading them.
- num_bits=1 indexes carry no ex codes and are unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The FastScan ex bulk path is only reachable when lower-bound gating is
disabled (legacy indexes without error factors); gated indexes rerank
per candidate with the ex-dot kernels. Build its artifacts accordingly:

- Compute the u8 ex LUT directly from the rotated query (the per-dim
  table is the pure multiplication q[d] * code), removing the
  dim * 2^ex_bits f32 table from the query context, the calculator,
  and the scratch layout entirely. This also speeds the bypass path
  itself up by 3-6%.
- Skip the FastScan transpose (packed_ex_codes) when error factors are
  present, saving one resident copy of the ex codes and the per-load
  transpose for every fresh index. Bulk scoring on gated indexes falls
  through to the exact per-row kernels.

The LUT bulk path stays for legacy indexes: at dim=1536 it scores 2.5x
(ex_bits=4) to 7x (ex_bits=2) faster per row than the kernel loop.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer breaking-change performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants