refactor: Simplify approx_distinct (-200 LoC)#22921
Open
2010YOUY01 wants to merge 4 commits into
Open
Conversation
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Attempt to simplify the
approx_distinctimplementation, the existing complexity is due to there is no generic API to calculate hash for aarray.elem(i), so we have to implement specialization for many different types like primitive/string/stringview, and bloated the code size.This PR used a existing
create_hashesfor batched hashing that is applicable to all array types, and it reduced 261 lines of code inapprox_distinct.rsPerformance
Cargo bench result
cargo bench -p datafusion-functions-aggregate \ --bench approx_distinct \ -- --baseline mainI think it's still a good idea to ignore the regression and simplify the code due to:
Amdahl's Law
If we make function X 20% faster, but function X only takes 1% of query time, then the complexity to win the performance might not be worthy: specifically the microbench only measured
update_batch()function, this piece of code is highly vectorizable, and it can very unlikely to be significant on real queries.I tried to construct a query that is very heavy on
update_batch, still can't observe end-to-end difference:LLVM Optimization
For the slowest microbench, I think the root cause is that LLVM can optimize the manually simplified code more easily.
The existing implementation has the following fast path:
datafusion/datafusion/functions-aggregate/src/approx_distinct.rs
Lines 254 to 261 in b8998c7
The same optimization also exists in the common, simpler API
create_hashes:datafusion/datafusion/common/src/hash_utils.rs
Line 352 in b8998c7
The existing implementation is still faster likely because the code is manually specialized, while
create_hashesis more branchy. This makes LLVM easier to figure out how to optimize and bring 20% speedup.However, this kind of optimization can be applied endlessly and would introduce complexity everywhere, so I do not think it is worth preserving here.
What changes are included in this PR?
create_hasheswith a hash state that is optimized for statistical qualityapprox_distinctwith create_hashesAre these changes tested?
Are there any user-facing changes?