Add maxValueCount to DocValuesSkipper metadata#15993
Conversation
jainankitk
left a comment
There was a problem hiding this comment.
I am wondering if this information needs to stored as part of DocValuesSkipper as this should be single value per field per segment right? Also, this PR - #15737 might be related.
Thanks for taking a look. Yes, I put it on |
|
@romseygeek would you be able to take a look at this? Thanks! |
msfroh
left a comment
There was a problem hiding this comment.
This looks reasonable to me.
My only concern is whether it addresses @romseygeek's sentence on #15794:
This is fragile, and not necessarily supported by Codecs.
This still depends on the individual DocValuesFormat implementation storing and returning this data.
I don't know how we can reasonably implement it in a DocValuesSkipper if we don't add it at the DocValuesFormat layer, though. (That might be a gap in my understanding, though.)
|
@msfroh yes right, this does lean on the DocValuesFormat to actually store the number. But docCount, minValue, and maxValue on the skipper already have that same dependency. The default maxValueCount() returns Integer.MAX_VALUE when the format doesn't know the exact count, so it's a safe fallback. It's just a conservative upper bound. I thought about FieldInfo, but that feels like schema metadata and this is really a per-segment stat. Sticking it on the iterator felt wrong too. That's for walking docs, not for segment-level stats. And computing it on demand is exactly the fragile thing @romseygeek was pointing out in #15794. Yeah, fields without a skipper are out of luck here. But those are usually not the numeric/ordinal fields the issue is about. For fields that do have a skipper, this kills the weird hack where we unwrap SortedNumericDocValues to see if it's secretly a singleton. Thanks! |
Right! From an understandability standpoint, I think this is a big win. (Also, I'm imagining some potential efficiency gains that one could eke out when they know that max value count is "low".) |
|
@jpountz would you be able to take a look at this? thanks! |
| if (version >= Lucene90DocValuesFormat.VERSION_SKIPPER_MAX_VALUE_COUNT) { | ||
| maxValueCount = meta.readInt(); | ||
| } else { | ||
| maxValueCount = docCount == 0 ? 0 : Integer.MAX_VALUE; |
There was a problem hiding this comment.
I think we can be a bit smarter here? eg, for Numeric/Sorted entries we know this value is 1, and for SortedNumeric/SortedSet we can check the relevant XXEntry and see if numValues == numDocsWithField and return 1 for those cases?
I think I'd prefer returning -1 to returning Integer.MAX_VALUE for cases where we don't know the exact value to make it clear that the data isn't available.
There was a problem hiding this comment.
makes sense, sure. done. Thanks!
cc3340c to
5e76ffa
Compare
|
|
||
| readFields(in, state.fieldInfos); | ||
|
|
||
| // For older segments without maxValueCount metadata, infer it from entry data |
There was a problem hiding this comment.
Let's pull this into a separate method and gate it behind a version check
5e76ffa to
e86f86e
Compare
|
Thanks for all the iterations @iprithv, this is looking great. There's a lint failure in there, can you fix that up? |
e86f86e to
634c4d9
Compare
sure, done. Thanks for reviewing this! |
Returns the maximum document cardinality for docs in a segment. For older indexes where this information is not available, returns -1. Singleton instances will return 1 even on older indexes.
|
Thanks @iprithv - I adjusted the CHANGES entry and milestone to be for 10.5 as I don't think this needs to wait for a major release. |
Port inferMaxValueCounts from apache/lucene#15993 to the TSDB codec so segments written before VERSION_SKIPPER_MAX_VALUE_COUNT return 1 rather than -1 when the field type or per field metadata proves the field is single valued. Applied in AbstractTSDBDocValuesProducer for ES95 and ES819, and in ES87TSDBDocValuesProducer for the legacy codec.
Port the consumer and producer changes from apache/lucene#15993 to the TSDB doc values codec so the new DocValuesSkipper.maxValueCount() API returns accurate values rather than the unknown sentinel. The newer ES95 and ES819 codecs persist and read maxValueCount via the shared abstract base classes, with a TSDB format version bump to VERSION_SKIPPER_MAX_VALUE_COUNT to gate adoption. The legacy ES87 codec, which throws on writes in production but is still exercised by tests through a writable consumer in the test source tree, receives parallel changes scoped to its own files and its own format version. Closes elastic#149715
Port inferMaxValueCounts from apache/lucene#15993 to the TSDB codec so segments written before VERSION_SKIPPER_MAX_VALUE_COUNT return 1 rather than -1 when the field type or per field metadata proves the field is single valued. Applied in AbstractTSDBDocValuesProducer for ES95 and ES819, and in ES87TSDBDocValuesProducer for the legacy codec.
…9778) * fix(tsdb): wire DocValuesSkipper.maxValueCount() Port the consumer and producer changes from apache/lucene#15993 to the TSDB doc values codec so the new DocValuesSkipper.maxValueCount() API returns accurate values rather than the unknown sentinel. The newer ES95 and ES819 codecs persist and read maxValueCount via the shared abstract base classes, with a TSDB format version bump to VERSION_SKIPPER_MAX_VALUE_COUNT to gate adoption. The legacy ES87 codec, which throws on writes in production but is still exercised by tests through a writable consumer in the test source tree, receives parallel changes scoped to its own files and its own format version. Closes #149715 * fix(tsdb): infer maxValueCount for old segments Port inferMaxValueCounts from apache/lucene#15993 to the TSDB codec so segments written before VERSION_SKIPPER_MAX_VALUE_COUNT return 1 rather than -1 when the field type or per field metadata proves the field is single valued. Applied in AbstractTSDBDocValuesProducer for ES95 and ES819, and in ES87TSDBDocValuesProducer for the legacy codec.
Description
resolves #15794
DocValuesSkipper#maxValueCount()so consumers can determine whether a doc values field is definitely single-valued without unwrapping iterators.maxValueCountmetadata in Lucene90 doc values skipper metadata for new segments.Compatibility
Older Lucene90 segments do not have this metadata, so
maxValueCount()returns a safe upper bound:0for empty fields andInteger.MAX_VALUEotherwise. Exact values become available once segments are rewritten or merged with the new format.