Skip to content

Use Panama Vector API to SIMD-evaluate fixed-cardinality sorted numeric range queries in rangeIntoBitSet()#16283

Open
costin wants to merge 2 commits into
apache:mainfrom
costin:lucene/sorted-numeric-fixed-simd
Open

Use Panama Vector API to SIMD-evaluate fixed-cardinality sorted numeric range queries in rangeIntoBitSet()#16283
costin wants to merge 2 commits into
apache:mainfrom
costin:lucene/sorted-numeric-fixed-simd

Conversation

@costin

@costin costin commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

When the stored values have fixed cardinality and no encoding transforms (no gcd, delta, table, or block compression), the vectorization provider loads N values into a SIMD vector, performs a broadcast range check (>= min AND <= max), collapses per-lane results into a per-doc mask, and OR-writes matching docs into the bitset in one operation.

Falls back to scalar when vectorLen % cardinality != 0 (e.g. vpd=8 on AVX2 with 4-lane vectors).

Benchmark

SortedNumericDocValuesRangeQueryBenchmark, 1M docs, cardinality=fixed, density=dense, queryShape=plain. Branch vs main, JDK 25.0.3.

AMD EPYC 7R32 (c5a.2xlarge) — AVX2, 256-bit (4 longs)

SIMD: vpd=2 (2 docs/vec), vpd=4 (1 doc/vec). vpd=8 falls back to scalar.

vpd pattern selectivity baseline (ops/s) candidate (ops/s) ratio
2 clustered 0.01 15426.5 15565.0 1.01x
2 clustered 0.1 15331.1 15431.7 1.01x
2 clustered 0.5 19083.7 19168.8 1.00x
2 random 0.01 65.2 134.1 2.06x
2 random 0.1 58.7 95.8 1.63x
2 random 0.5 65.1 86.7 1.33x
4 clustered 0.01 9830.6 9843.3 1.00x
4 clustered 0.1 9676.3 9852.7 1.02x
4 clustered 0.5 18409.4 18256.0 0.99x
4 random 0.01 52.3 70.2 1.34x
4 random 0.1 46.3 49.7 1.07x
4 random 0.5 53.2 64.3 1.21x
8 clustered 0.01 5925.1 5891.1 0.99x
8 clustered 0.1 5830.6 5845.9 1.00x
8 clustered 0.5 17669.9 17599.9 1.00x
8 random 0.01 42.1 43.7 1.04x
8 random 0.1 38.6 40.9 1.06x
8 random 0.5 44.2 50.2 1.14x

Intel Xeon 8375C (c6i.2xlarge) — AVX-512, 512-bit (8 longs)

SIMD: vpd=2 (4 docs/vec), vpd=4 (2 docs/vec), vpd=8 (1 doc/vec).

vpd pattern selectivity baseline (ops/s) candidate (ops/s) ratio
2 clustered 0.01 19255.8 19439.4 1.01x
2 clustered 0.1 18689.0 19133.6 1.02x
2 clustered 0.5 22700.0 22745.1 1.00x
2 random 0.01 83.3 208.4 2.50x
2 random 0.1 67.6 133.4 1.97x
2 random 0.5 65.7 127.8 1.94x
4 clustered 0.01 11715.4 11658.7 1.00x
4 clustered 0.1 11791.0 11727.0 0.99x
4 clustered 0.5 20998.0 21080.8 1.00x
4 random 0.01 63.5 104.1 1.64x
4 random 0.1 50.9 65.9 1.29x
4 random 0.5 61.5 94.7 1.54x
8 clustered 0.01 7133.1 7202.5 1.01x
8 clustered 0.1 6956.8 6995.9 1.01x
8 clustered 0.5 20338.1 20369.2 1.00x
8 random 0.01 47.2 53.5 1.13x
8 random 0.1 43.2 38.1 0.88x
8 random 0.5 51.6 54.1 1.05x

Clustered data shows no change since sequential access is already at L1/L2 cache speed; comparison cost is negligible. Wins appear on random data where per-doc cache misses dominate and SIMD batching amortizes comparison overhead.

Gains scale with docsPerVector: vpd=2 on AVX-512 processes 4 docs per vector (best), vpd=8 on AVX2 falls back to scalar (no gain).

Dense fixed-cardinality sorted numeric values can evaluate range blocks
with the vectorization provider when the flattened value layout is raw
and contiguous. Keep the optimization gated to layouts that benchmark
well and retain scalar fallback behavior for other encodings.
*
* @lucene.internal
*/
public interface SortedNumericDocValuesRangeSupport {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think creating a separate interface for this specific use case looks right.
It differs from the existing DocValuesRangeSupport only by a single parameter ie cardinality. So doesn't justify creating a new abstraction layer just based on that, probably we can create or add just another method to existing DocValuesRangeSupport? Something like:

default void rangeIntoBitSet(LongValues values,
      int fromDoc,
      int toDoc,
      int cardinality,
      long minValue,
      long maxValue,
      FixedBitSet bitSet,
      int offset) {
      // default to scalar approach.
}

Let me know what you think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I've removed the interface in favor of a new method which has the nice benefit of reducing the PR size.

@costin costin force-pushed the lucene/sorted-numeric-fixed-simd branch from 27af116 to 015f370 Compare June 23, 2026 12:59
@github-actions github-actions Bot added this to the 10.6.0 milestone Jun 23, 2026

@sgup432 sgup432 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have few minor comments. Also I think we should unit tests which cover scenarios with different cardinality values where it is > 1, vectorLen % cardinality != 0 and other cases? Assuming this is not already covered via existing tests.

@costin

costin commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Have few minor comments. Also I think we should unit tests which cover scenarios with different cardinality values where it is > 1, vectorLen % cardinality != 0 and other cases? Assuming this is not already covered via existing tests.

See my other comments. I've parameterized testSortedNumericRangeIntoBitSetVaryingCardinality to exercise the other cardinalities {2, 3, 4, 5, 7, 8} to check both the SIMD and fallback scalar path.

@sgup432 sgup432 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants