research(nightly): PDX columnar vector layout with dimension-pruning scan#441
Draft
research(nightly): PDX columnar vector layout with dimension-pruning scan#441
Conversation
…sion-pruning scan Implements PDX (Kuffo, Krippner, Boncz — SIGMOD 2025, arXiv:2503.04422): transpose vector storage from row-major to columnar within each partition block so LLVM auto-vectorises the distance kernel without hand-written intrinsics. Three backends behind the AnnIndex trait: - RowMajorIndex: row-major brute-force baseline (100% recall) - PdxFlatIndex: PDX columnar layout, no pruning (2.16–3.42× faster) - PdxPruneIndex: PDX + exponential lower-bound pruning (2.01–2.75× faster) Measured results (x86_64 --release, 200 queries, 100% recall all variants): n=10K D=96: RowMajor 2,023 QPS → PdxFlat 4,726 QPS (+2.34×) n=10K D=384: RowMajor 400 QPS → PdxFlat 1,148 QPS (+2.87×) n=50K D=128: RowMajor 283 QPS → PdxFlat 610 QPS (+2.16×) n=50K D=384: RowMajor 59 QPS → PdxFlat 202 QPS (+3.42×) 12 integration tests, zero mocks, zero unsafe. First Rust implementation of PDX. https://claude.ai/code/session_018oQ9jHA4QPFk5h15nEw61T
…g scan Records the decision to add ruvector-pdx as a new crate implementing the SIGMOD 2025 PDX data layout. Documents speedup measurements, integration path into ruvector-cluster, and alternatives considered (AVX2 intrinsics, simsimd, MRL, Product Quantization). https://claude.ai/code/session_018oQ9jHA4QPFk5h15nEw61T
Comprehensive research document covering: - SOTA survey: PDX vs FAISS/Qdrant/Milvus/LanceDB layout strategies - How-it-works walkthrough (blog-readable) - Real benchmark numbers from cargo run --release -p ruvector-pdx - Practical failure modes (small blocks, uniform data, NUMA) - Roadmap: block_size=256, ruvector-cluster integration, ADSampling χ² bound - Production crate layout proposal https://claude.ai/code/session_018oQ9jHA4QPFk5h15nEw61T
Registers crates/ruvector-pdx in the workspace so cargo build --workspace and cargo test --workspace include the new PDX crate automatically. https://claude.ai/code/session_018oQ9jHA4QPFk5h15nEw61T
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/adr/ADR-193-pdx-columnar-scan.mddocs/research/nightly/2026-05-08-pdx-columnar-scan/README.mdcrates/ruvector-pdx/— first Rust implementation of PDXWhat PDX is
Traditional vector stores use row-major layout: each vector occupies a contiguous
row of D floats. Accessing dimension
dacross N vectors requires stride-D loads —not SIMD-friendly. PDX transposes within each partition block so dimension
disstored as a contiguous column of N floats, making the inner loop stride-1 and
enabling LLVM to auto-vectorise with AVX2/AVX-512.
Real benchmark results (x86_64 --release, 100% recall all variants)
Speedup grows with D — highest impact on 384D and 1536D text embeddings.
Deliverables
crates/ruvector-pdx/— new crate with 3 backends + 12 integration tests + benchmark binarycargo build --release -p ruvector-pdx✅cargo test -p ruvector-pdx✅ (12/12 passed, no mocks)cargo run --release -p ruvector-pdx✅ (real benchmark numbers above)docs/adr/ADR-193-pdx-columnar-scan.md✅docs/research/nightly/2026-05-08-pdx-columnar-scan/README.md✅Test plan
crates/ruvector-pdx/src/index.rs— three backends behindAnnIndextraitcargo test -p ruvector-pdxpasses (all 12 tests)https://claude.ai/code/session_018oQ9jHA4QPFk5h15nEw61T