Skip to content

Use LongHashSet for sparse ordinal sets in DocValuesRangeIterator#16297

Open
costin wants to merge 4 commits into
apache:mainfrom
costin:lucene/sparse-ordinal-set
Open

Use LongHashSet for sparse ordinal sets in DocValuesRangeIterator#16297
costin wants to merge 4 commits into
apache:mainfrom
costin:lucene/sparse-ordinal-set

Conversation

@costin

@costin costin commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Replace LongBitSet with LongHashSet in buildOrdinalSet() so memory scales with the number of matching terms rather than the total ordinal count of the segment.

This is an issue since DocValuesRangeIterator.buildOrdinalSet() builds a set of matching ordinals for MultiTermQuery rewrites which fires for prefix, wildcard, and regexp queries on doc-values-only fields.

The current implementation allocates, per query, a LongBitSet(ordCount) where ordCount is the total number of unique values in the segment. This means a wildcard query matching 10 terms against a 10M-cardinality field allocates 1.2 MB per segment of mostly-zero bits.

This PR replaces LongBitSet with LongHashSet, which allocates proportionally to matching terms (~32 bytes/entry). Contiguous ordinal sets skip the hash set allocation entirely, ords are collected into a temporary long[] and contiguity is detected from the sorted stream.

Scenario Old New
Contiguous (range query) LongBitSet(ordCount) allocated, populated, discarded long[matchCount] temp array only ; no set allocated
Non-contiguous (wildcard) LongBitSet(ordCount) allocated, kept for per-doc lookup long[matchCount] temp + LongHashSet(matchCount)
Empty (no matching terms) returns null returns null (unchanged)

Memory per query instance per segment

matchCount ordCount Old (LongBitSet) New (LongHashSet) Saved Extra time/1M docs
10 1K 128 B 320 B -192 B +1.4 ms
10 100K 12.2 KB 320 B 38x +1.4 ms
10 10M 1.2 MB 320 B 3,906x +2.1 ms
100 1K 128 B 2.5 KB -2.4 KB +1.2 ms
100 100K 12.2 KB 2.5 KB 5x +1.2 ms
100 10M 1.2 MB 2.5 KB 500x +1.1 ms
1000 1K 128 B 25 KB -24.9 KB +0.5 ms
1000 100K 12.2 KB 25 KB -12.8 KB +1.5 ms
1000 10M 1.2 MB 25 KB 50x +1.6 ms

Replace LongBitSet with LongHashSet in buildOrdinalSet() so
memory scales with the number of matching terms rather than
the total ordinal count of the segment.

Defer the LongHashSet allocation until after the contiguity
check so contiguous ordinal sets (which route to
forOrdinalRange) never allocate a set at all. Collect ords
into a temporary long[] during TermsEnum iteration, detect
contiguity from the sorted ord stream, then build the
LongHashSet only for non-contiguous sets.

The internal OrdinalSet record now holds a LongPredicate for
membership testing, decoupling it from the concrete set type.
The public forOrdinalSet(SortedSetDocValues, ..., LongBitSet)
overload is unchanged.
@github-actions github-actions Bot added this to the 10.6.0 milestone Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant