Skip to content

Add maxValueCount to DocValuesSkipper metadata#15993

Merged
romseygeek merged 2 commits into
apache:mainfrom
iprithv:dv-skipper-max-count
May 20, 2026
Merged

Add maxValueCount to DocValuesSkipper metadata#15993
romseygeek merged 2 commits into
apache:mainfrom
iprithv:dv-skipper-max-count

Conversation

@iprithv

@iprithv iprithv commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Description

resolves #15794

  • Add DocValuesSkipper#maxValueCount() so consumers can determine whether a doc values field is definitely single-valued without unwrapping iterators.
  • Persist exact global maxValueCount metadata in Lucene90 doc values skipper metadata for new segments.
  • Add validation and test coverage for empty, single-valued, and multi-valued skipper fields.

Compatibility

Older Lucene90 segments do not have this metadata, so maxValueCount() returns a safe upper bound: 0 for empty fields and Integer.MAX_VALUE otherwise. Exact values become available once segments are rewritten or merged with the new format.

@jainankitk jainankitk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this information needs to stored as part of DocValuesSkipper as this should be single value per field per segment right? Also, this PR - #15737 might be related.

@iprithv

iprithv commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

I am wondering if this information needs to stored as part of DocValuesSkipper as this should be single value per field per segment right? Also, this PR - #15737 might be related.

Thanks for taking a look.

Yes, maxValueCount is a single global value per field per segment. This PR stores it that way: it is part of the Lucene90 DocValuesSkipperEntry field metadata, alongside the existing global minValue, maxValue, docCount, and maxDocId. It is not stored per skip interval/block.

I put it on DocValuesSkipper because this is where Lucene already exposes doc-values-derived segment-level metadata for fields that have a skip index. Putting it in FieldInfo would make it look like schema/static field metadata, but it is actually a computed per-segment doc values statistic and may be unavailable when no skipper is present.

@iprithv iprithv requested a review from jainankitk May 1, 2026 17:42
@iprithv

iprithv commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

@romseygeek would you be able to take a look at this? Thanks!

@msfroh msfroh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me.

My only concern is whether it addresses @romseygeek's sentence on #15794:

This is fragile, and not necessarily supported by Codecs.

This still depends on the individual DocValuesFormat implementation storing and returning this data.

I don't know how we can reasonably implement it in a DocValuesSkipper if we don't add it at the DocValuesFormat layer, though. (That might be a gap in my understanding, though.)

@iprithv

iprithv commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

@msfroh yes right, this does lean on the DocValuesFormat to actually store the number. But docCount, minValue, and maxValue on the skipper already have that same dependency. The default maxValueCount() returns Integer.MAX_VALUE when the format doesn't know the exact count, so it's a safe fallback. It's just a conservative upper bound.

I thought about FieldInfo, but that feels like schema metadata and this is really a per-segment stat. Sticking it on the iterator felt wrong too. That's for walking docs, not for segment-level stats. And computing it on demand is exactly the fragile thing @romseygeek was pointing out in #15794.

Yeah, fields without a skipper are out of luck here. But those are usually not the numeric/ordinal fields the issue is about. For fields that do have a skipper, this kills the weird hack where we unwrap SortedNumericDocValues to see if it's secretly a singleton. Thanks!

@msfroh

msfroh commented May 14, 2026

Copy link
Copy Markdown
Contributor

For fields that do have a skipper, this kills the weird hack where we unwrap SortedNumericDocValues to see if it's secretly a singleton.

Right! From an understandability standpoint, I think this is a big win. (Also, I'm imagining some potential efficiency gains that one could eke out when they know that max value count is "low".)

@iprithv

iprithv commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

@jpountz would you be able to take a look at this? thanks!

if (version >= Lucene90DocValuesFormat.VERSION_SKIPPER_MAX_VALUE_COUNT) {
maxValueCount = meta.readInt();
} else {
maxValueCount = docCount == 0 ? 0 : Integer.MAX_VALUE;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be a bit smarter here? eg, for Numeric/Sorted entries we know this value is 1, and for SortedNumeric/SortedSet we can check the relevant XXEntry and see if numValues == numDocsWithField and return 1 for those cases?

I think I'd prefer returning -1 to returning Integer.MAX_VALUE for cases where we don't know the exact value to make it clear that the data isn't available.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, sure. done. Thanks!

@iprithv iprithv force-pushed the dv-skipper-max-count branch 3 times, most recently from cc3340c to 5e76ffa Compare May 18, 2026 10:52
@iprithv iprithv requested a review from romseygeek May 18, 2026 10:52

readFields(in, state.fieldInfos);

// For older segments without maxValueCount metadata, infer it from entry data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pull this into a separate method and gate it behind a version check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

@iprithv iprithv force-pushed the dv-skipper-max-count branch from 5e76ffa to e86f86e Compare May 18, 2026 16:27
@iprithv iprithv requested a review from romseygeek May 18, 2026 16:29
@romseygeek

Copy link
Copy Markdown
Contributor

Thanks for all the iterations @iprithv, this is looking great. There's a lint failure in there, can you fix that up?

@iprithv iprithv force-pushed the dv-skipper-max-count branch from e86f86e to 634c4d9 Compare May 20, 2026 00:18
@iprithv

iprithv commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for all the iterations @iprithv, this is looking great. There's a lint failure in there, can you fix that up?

sure, done. Thanks for reviewing this!

@romseygeek romseygeek merged commit 696f4e7 into apache:main May 20, 2026
13 checks passed
@romseygeek romseygeek modified the milestones: 11.0.0, 10.5.0 May 20, 2026
romseygeek added a commit that referenced this pull request May 20, 2026
romseygeek pushed a commit that referenced this pull request May 20, 2026
Returns the maximum document cardinality for docs in a segment.  For
older indexes where this information is not available, returns -1.  Singleton
instances will return 1 even on older indexes.
@romseygeek

Copy link
Copy Markdown
Contributor

Thanks @iprithv - I adjusted the CHANGES entry and milestone to be for 10.5 as I don't think this needs to wait for a major release.

salvatore-campagna added a commit to salvatore-campagna/elasticsearch that referenced this pull request May 26, 2026
Port inferMaxValueCounts from apache/lucene#15993 to the TSDB codec so
segments written before VERSION_SKIPPER_MAX_VALUE_COUNT return 1 rather
than -1 when the field type or per field metadata proves the field is
single valued. Applied in AbstractTSDBDocValuesProducer for ES95 and
ES819, and in ES87TSDBDocValuesProducer for the legacy codec.
salvatore-campagna added a commit to salvatore-campagna/elasticsearch that referenced this pull request May 26, 2026
Port the consumer and producer changes from apache/lucene#15993 to the
TSDB doc values codec so the new DocValuesSkipper.maxValueCount() API
returns accurate values rather than the unknown sentinel. The newer
ES95 and ES819 codecs persist and read maxValueCount via the shared
abstract base classes, with a TSDB format version bump to
VERSION_SKIPPER_MAX_VALUE_COUNT to gate adoption. The legacy ES87
codec, which throws on writes in production but is still exercised by
tests through a writable consumer in the test source tree, receives
parallel changes scoped to its own files and its own format version.

Closes elastic#149715
salvatore-campagna added a commit to salvatore-campagna/elasticsearch that referenced this pull request May 26, 2026
Port inferMaxValueCounts from apache/lucene#15993 to the TSDB codec so
segments written before VERSION_SKIPPER_MAX_VALUE_COUNT return 1 rather
than -1 when the field type or per field metadata proves the field is
single valued. Applied in AbstractTSDBDocValuesProducer for ES95 and
ES819, and in ES87TSDBDocValuesProducer for the legacy codec.
salvatore-campagna added a commit to elastic/elasticsearch that referenced this pull request May 26, 2026
…9778)

* fix(tsdb): wire DocValuesSkipper.maxValueCount()

Port the consumer and producer changes from apache/lucene#15993 to the
TSDB doc values codec so the new DocValuesSkipper.maxValueCount() API
returns accurate values rather than the unknown sentinel. The newer
ES95 and ES819 codecs persist and read maxValueCount via the shared
abstract base classes, with a TSDB format version bump to
VERSION_SKIPPER_MAX_VALUE_COUNT to gate adoption. The legacy ES87
codec, which throws on writes in production but is still exercised by
tests through a writable consumer in the test source tree, receives
parallel changes scoped to its own files and its own format version.

Closes #149715

* fix(tsdb): infer maxValueCount for old segments

Port inferMaxValueCounts from apache/lucene#15993 to the TSDB codec so
segments written before VERSION_SKIPPER_MAX_VALUE_COUNT return 1 rather
than -1 when the field type or per field metadata proves the field is
single valued. Applied in AbstractTSDBDocValuesProducer for ES95 and
ES819, and in ES87TSDBDocValuesProducer for the legacy codec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add maxValueCount to DocValuesSkipper metadata

4 participants