Use the doc-values skip index to skip per-doc value lookups in LongRangeFacetCutter#16268
Use the doc-values skip index to skip per-doc value lookups in LongRangeFacetCutter#16268slow-J wants to merge 3 commits into
Conversation
03d7d2a to
066c419
Compare
|
I reran benchmarks, this time correctly using localrun, and updated the results in #16268 (comment) |
2e7144b to
0c72d5f
Compare
epotyom
left a comment
There was a problem hiding this comment.
Nice change! One suggestion below
| for (int level = 0; level < skipper.numLevels(); ++level) { | ||
| int totalDocsAtLevel = skipper.maxDocID(level) - skipper.minDocID(level) + 1; | ||
| if (skipper.docCount(level) != totalDocsAtLevel) { | ||
| // Some docs at this level have no value, so we can't resolve the whole block at once. |
There was a problem hiding this comment.
I think skipper can stil improve performance for this case, as we can still cache ordinal, it is just that in this case we have to always call longValues.advanceExact(doc). If it returns true - we return cached ordinal (and avoid reading long value as well as binary search elementary interval), otherwise return false
There was a problem hiding this comment.
Thanks for the review Egor! Good point!
I'll try to address these 2 comments and run new benchmarks.
There was a problem hiding this comment.
Took me some time to publish the changes as I managed to initially introduce a regression into range facets without skipper, I done a small refactor while fixing that.
I am getting much better performance after implementing your suggestions, I will update the benchmark results in the top level comment.
Edit: updated benchmark results.
There was a problem hiding this comment.
Hmm looking at the new benchmark results, there is an improvement in the ID tasks, which do not have a doc-values skip index. This is due to a change in the latest commit.
id is a single-valued field, but in main, fromLongField never unwraps to a single-valued source, always picking the multi-valued leaf cutter even when the field is single-valued.
We now route single-valued segments to the single-valued cutter instead. The new create(String field, …) keeps the field name, which lets createLeafCutter inspect each segment during search and pick the right cutter.
I have kept this in this PR but its slightly increasing the scope.
@epotyom let me know what you think about this.
There was a problem hiding this comment.
Interesting, nice catch!
This is due to a change in the latest commit.
Does the latest commit also include the interval-tracking rewind change? If so, could you please run benchmarks for the unwrapping change only?
Unwrapping adds a little bit of complexity, but if it improves performance, I think we should keep it.
There was a problem hiding this comment.
Does the latest commit also include the interval-tracking rewind change?
Yes, it has all three changes: non-dense fast path, rewind reuse, and unwrapping.
Unwrapping adds a little bit of complexity, but if it improves performance, I think we should keep it.
I'll setup and run a benchmark now just to see the perf diff due to the unwrapping.
There was a problem hiding this comment.
Ran: python3 -u src/python/localrun.py -s rangeFacetsWikimediumAll -b lucene_baseline -c lucene_candidate -iterations 30 -warmups 20 2>&1 | tee "$BASE/run9-onlyunwrapping-timing.txt"
Heres the result.
Latency (ms) — aggregated across all iterations
| Task | P50 B | P50 C | Diff | P90 B | P90 C | Diff | P99 B | P99 C | Diff | P999 B | P999 C | Diff | P100 B | P100 C | Diff |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BrowseIDOvlpRangeFacets | 946.928 | 820.170 | -13.4% | 1279.722 | 1178.839 | -7.9% | 1509.599 | 3489.425 | +131.1% | 1579.256 | 10588.392 | +570.5% | 1647.232 | 10662.196 | +547.3% |
| BrowseIDRangeFacets | 410.635 | 296.512 | -27.8% | 630.174 | 670.131 | +6.3% | 869.641 | 866.369 | -0.4% | 931.130 | 8681.360 | +832.3% | 1008.106 | 9240.540 | +816.6% |
| BrowseLastModOvlpRangeFacets | 381.116 | 365.262 | -4.2% | 577.681 | 639.843 | +10.8% | 866.191 | 2046.696 | +136.3% | 1028.218 | 10214.331 | +893.4% | 1029.600 | 10234.017 | +894.0% |
| BrowseLastModRangeFacets | 324.172 | 317.631 | -2.0% | 526.130 | 634.818 | +20.7% | 808.836 | 884.852 | +9.4% | 851.734 | 9413.995 | +1005.3% | 887.375 | 9912.493 | +1017.1% |
| MedTermIDOvlpRangeFacets | 213.936 | 181.524 | -15.1% | 433.905 | 437.942 | +0.9% | 603.100 | 702.917 | +16.6% | 679.574 | 838.447 | +23.4% | 761.983 | 848.642 | +11.4% |
| MedTermIDRangeFacets | 211.334 | 171.214 | -19.0% | 519.706 | 555.351 | +6.9% | 735.139 | 713.797 | -2.9% | 834.724 | 836.369 | +0.2% | 840.486 | 843.231 | +0.3% |
| MedTermLastModOvlpRangeFacets | 176.353 | 173.086 | -1.9% | 486.851 | 537.847 | +10.5% | 712.630 | 719.706 | +1.0% | 830.068 | 832.742 | +0.3% | 842.260 | 840.165 | -0.2% |
| MedTermLastModRangeFacets | 188.926 | 182.942 | -3.2% | 487.783 | 553.020 | +13.4% | 731.373 | 718.764 | -1.7% | 830.363 | 832.185 | +0.2% | 841.503 | 834.076 | -0.9% |
QPS
| Task | QPS baseline | StdDev | QPS modified | StdDev | Pct diff | p-value |
|---|---|---|---|---|---|---|
| BrowseLastModRangeFacets | 3.26 | (5.8%) | 3.34 | (8.1%) | 2.6% (-10% - 17%) | 0.155 |
| BrowseLastModOvlpRangeFacets | 2.75 | (5.1%) | 2.85 | (6.2%) | 3.6% (-7% - 15%) | 0.015 |
| MedTermLastModRangeFacets | 5.58 | (7.9%) | 5.80 | (8.6%) | 3.8% (-11% - 22%) | 0.072 |
| MedTermLastModOvlpRangeFacets | 5.89 | (6.9%) | 6.13 | (8.2%) | 4.1% (-10% - 20%) | 0.035 |
| MedTermIDOvlpRangeFacets | 4.94 | (6.3%) | 5.68 | (6.6%) | 15.0% (1% - 29%) | 0.000 |
| MedTermIDRangeFacets | 5.19 | (9.9%) | 6.17 | (7.1%) | 18.7% (1% - 39%) | 0.000 |
| BrowseIDOvlpRangeFacets | 1.12 | (7.4%) | 1.33 | (11.9%) | 19.1% (0% - 41%) | 0.000 |
| BrowseIDRangeFacets | 2.52 | (4.5%) | 3.60 | (12.0%) | 42.7% (25% - 61%) | 0.000 |
Mainly impacts the ID tasks which do not have a skipper.
But it does seem to cause a worrying latency regression at high percentile latency (p90 and onward).
Hmm, @epotyom what do you think? Since it is not related to the skipper change, I am partial towards removing the unwrapping and retesting performance.
1065433 to
7db2833
Compare
Resolves #16249
Implementation heavily inspired by HistogramCollector.java.
Range faceting (in the sandbox module -
LongRangeFacetCutter) currently reads the doc-values value for every matching document and binary-searches it into an elementary interval. When the faceted field is single-valued, we can use a doc-values skip index. For a dense skip block whose min and max values fall into the same elementary interval, every document in that block maps to that interval, allowing us to skip the per-doc value lookup and binary search.Limitation - applies to single-valued, long fields only.
Benchmark (luceneutil)
I used my branch of https://github.com/slow-J/luceneutil/tree/github-16249-range-facet-bench which cherry picked 2 of @epotyom 's commits (mainly mikemccand/luceneutil#582 which adds range-facet support)
Setup:
runlocal.py,wikimediumall(33.3M docs), index-sorted bylastMod_skipperwithaddDVSkippers=true. baseline = main, candidate = this change, both DURING_COLLECTION, sothe only difference is this optimization. 30 JVM iterations.
Command:
python3 -u src/python/localrun.py -s rangeFacetsWikimediumAll -b lucene_baseline -c lucene_candidate -iterations 30 -warmups 20 2>&1 | tee "$BASE/run-timing7.txt"Edit: benchmark re-ran after the changes for Egors first 2 comments.
QPS
Caveat1: ID tasks have no skip index, so they are effectively noise. (See benchmark skipper setup (I did not fork anything here - LineFileDocs.java#L420)
Latency (ms) — aggregated across all iterations
Note: the id tasks speed up despite having no skip index, this due to a routing change which is part of this PR. single-valued segments now use the single-valued cutter instead of always falling to the multi-valued one.