Use LongHashSet for sparse ordinal sets in DocValuesRangeIterator by costin · Pull Request #16297 · apache/lucene

costin · 2026-06-25T21:21:37Z

Replace LongBitSet with LongHashSet in buildOrdinalSet() so memory scales with the number of matching terms rather than the total ordinal count of the segment.

This is an issue since DocValuesRangeIterator.buildOrdinalSet() builds a set of matching ordinals for MultiTermQuery rewrites which fires for prefix, wildcard, and regexp queries on doc-values-only fields.

The current implementation allocates, per query, a LongBitSet(ordCount) where ordCount is the total number of unique values in the segment. This means a wildcard query matching 10 terms against a 10M-cardinality field allocates 1.2 MB per segment of mostly-zero bits.

This PR replaces LongBitSet with LongHashSet, which allocates proportionally to matching terms (~32 bytes/entry). Contiguous ordinal sets skip the hash set allocation entirely, ords are collected into a temporary long[] and contiguity is detected from the sorted stream.

Scenario	Old	New
Contiguous (range query)	LongBitSet(ordCount) allocated, populated, discarded	long[matchCount] temp array only ; no set allocated
Non-contiguous (wildcard)	LongBitSet(ordCount) allocated, kept for per-doc lookup	long[matchCount] temp + LongHashSet(matchCount)
Empty (no matching terms)	returns null	returns null (unchanged)

Memory per query instance per segment

matchCount	ordCount	Old (`LongBitSet`)	New (`LongHashSet`)	Saved	Extra time/1M docs
10	1K	128 B	320 B	-192 B	+1.4 ms
10	100K	12.2 KB	320 B	38x	+1.4 ms
10	10M	1.2 MB	320 B	3,906x	+2.1 ms
100	1K	128 B	2.5 KB	-2.4 KB	+1.2 ms
100	100K	12.2 KB	2.5 KB	5x	+1.2 ms
100	10M	1.2 MB	2.5 KB	500x	+1.1 ms
1000	1K	128 B	25 KB	-24.9 KB	+0.5 ms
1000	100K	12.2 KB	25 KB	-12.8 KB	+1.5 ms
1000	10M	1.2 MB	25 KB	50x	+1.6 ms

Replace LongBitSet with LongHashSet in buildOrdinalSet() so memory scales with the number of matching terms rather than the total ordinal count of the segment. Defer the LongHashSet allocation until after the contiguity check so contiguous ordinal sets (which route to forOrdinalRange) never allocate a set at all. Collect ords into a temporary long[] during TermsEnum iteration, detect contiguity from the sorted ord stream, then build the LongHashSet only for non-contiguous sets. The internal OrdinalSet record now holds a LongPredicate for membership testing, decoupling it from the concrete set type. The public forOrdinalSet(SortedSetDocValues, ..., LongBitSet) overload is unchanged.

github-actions Bot added the module:core/search label Jun 25, 2026

Update CHANGES.txt for apache#16297

9ef7fbd

github-actions Bot added this to the 10.6.0 milestone Jun 26, 2026

costin added 2 commits June 26, 2026 10:41

Fix google-java-format in benchmark Javadoc

c7efe65

Fix unused lambda parameter in contiguous OrdinalSet

4528b21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use LongHashSet for sparse ordinal sets in DocValuesRangeIterator#16297

Use LongHashSet for sparse ordinal sets in DocValuesRangeIterator#16297
costin wants to merge 4 commits into
apache:mainfrom
costin:lucene/sparse-ordinal-set

costin commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

costin commented Jun 25, 2026

Memory per query instance per segment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant