refactor: Simplify `approx_distinct` (-200 LoC) by 2010YOUY01 · Pull Request #22921 · apache/datafusion

2010YOUY01 · 2026-06-12T03:30:28Z

Which issue does this PR close?

Closes #.

Rationale for this change

Attempt to simplify the approx_distinct implementation, the existing complexity is due to there is no generic API to calculate hash for a array.elem(i), so we have to implement specialization for many different types like primitive/string/stringview, and bloated the code size.

This PR used a existing create_hashes for batched hashing that is applicable to all array types, and it reduced 261 lines of code in approx_distinct.rs

Performance

Cargo bench result

cargo bench -p datafusion-functions-aggregate \
                                                                              --bench approx_distinct \
                                                                              -- --baseline main

nuplot not found, using plotters backend
Benchmarking approx_distinct i64 80% distinct: Collecting 100 samples in estimated 5.0135 s (884k ite
approx_distinct i64 80% distinct
                        time:   [5.6406 µs 5.6477 µs 5.6550 µs]
                        change: [−0.9680% −0.7111% −0.4639%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

Benchmarking approx_distinct utf8 short 80% distinct: Collecting 100 samples in estimated 5.0360 s (3
approx_distinct utf8 short 80% distinct
                        time:   [12.970 µs 12.977 µs 12.985 µs]
                        change: [+15.898% +16.116% +16.339%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high severe

Benchmarking approx_distinct utf8view short 80% distinct: Collecting 100 samples in estimated 5.0171
approx_distinct utf8view short 80% distinct
                        time:   [8.7402 µs 8.7455 µs 8.7511 µs]
                        change: [+22.516% +22.703% +22.893%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  5 (5.00%) high mild
  3 (3.00%) high severe

Benchmarking approx_distinct utf8 long 80% distinct: Collecting 100 samples in estimated 5.0120 s (26
approx_distinct utf8 long 80% distinct
                        time:   [19.060 µs 19.085 µs 19.108 µs]
                        change: [+9.9923% +10.224% +10.429%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking approx_distinct utf8view long 80% distinct: Collecting 100 samples in estimated 5.0800 s
approx_distinct utf8view long 80% distinct
                        time:   [21.281 µs 21.306 µs 21.335 µs]
                        change: [+1.8930% +2.0965% +2.3087%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking approx_distinct i64 99% distinct: Collecting 100 samples in estimated 5.0037 s (884k ite
approx_distinct i64 99% distinct
                        time:   [5.6507 µs 5.6645 µs 5.6805 µs]
                        change: [+0.1104% +0.3620% +0.6143%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking approx_distinct utf8 short 99% distinct: Collecting 100 samples in estimated 5.0298 s (3
approx_distinct utf8 short 99% distinct
                        time:   [12.956 µs 12.968 µs 12.979 µs]
                        change: [+15.776% +16.044% +16.296%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low mild
  2 (2.00%) high severe

Benchmarking approx_distinct utf8view short 99% distinct: Collecting 100 samples in estimated 5.0295
approx_distinct utf8view short 99% distinct
                        time:   [8.8005 µs 8.8054 µs 8.8106 µs]
                        change: [+22.620% +22.909% +23.202%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high severe

Benchmarking approx_distinct utf8 long 99% distinct: Collecting 100 samples in estimated 5.0467 s (26
approx_distinct utf8 long 99% distinct
                        time:   [19.134 µs 19.197 µs 19.309 µs]
                        change: [+9.0293% +9.4204% +9.8109%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high severe

Benchmarking approx_distinct utf8view long 99% distinct: Collecting 100 samples in estimated 5.0953 s
approx_distinct utf8view long 99% distinct
                        time:   [21.295 µs 21.332 µs 21.384 µs]
                        change: [+1.9350% +2.2141% +2.4823%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking approx_distinct u8 bitmap: Collecting 100 samples in estimated 5.0022 s (4.2M iterations
approx_distinct u8 bitmap
                        time:   [1.1961 µs 1.1976 µs 1.1994 µs]
                        change: [−3.8254% −3.1554% −2.5920%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

Benchmarking approx_distinct i8 bitmap: Collecting 100 samples in estimated 5.0058 s (4.2M iterations
approx_distinct i8 bitmap
                        time:   [1.2043 µs 1.2058 µs 1.2075 µs]
                        change: [−0.6865% −0.3102% −0.0076%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking approx_distinct u16 bitmap: Collecting 100 samples in estimated 5.0052 s (1.1M iteration
approx_distinct u16 bitmap
                        time:   [4.3272 µs 4.3392 µs 4.3521 µs]
                        change: [−1.5999% −1.1667% −0.7366%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild

Benchmarking approx_distinct i16 bitmap: Collecting 100 samples in estimated 5.0171 s (1.1M iteration
approx_distinct i16 bitmap
                        time:   [4.4383 µs 4.4431 µs 4.4479 µs]
                        change: [+2.4499% +2.8213% +3.1689%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild

Benchmarking approx_distinct_grouped/Int64 50000 groups: Collecting 10 samples in estimated 5.4914 s
approx_distinct_grouped/Int64 50000 groups
                        time:   [9.8851 ms 9.9005 ms 9.9293 ms]
                        change: [−3.5210% −2.9384% −2.4270%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking approx_distinct_grouped/Utf8 50000 groups: Collecting 10 samples in estimated 5.0165 s (
approx_distinct_grouped/Utf8 50000 groups
                        time:   [10.116 ms 10.148 ms 10.186 ms]
                        change: [−4.9237% −4.6468% −4.3582%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking approx_distinct_grouped/Utf8View 50000 groups: Collecting 10 samples in estimated 5.4867
approx_distinct_grouped/Utf8View 50000 groups
                        time:   [9.9450 ms 9.9498 ms 9.9556 ms]
                        change: [−3.4418% −3.0161% −2.5685%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

It shows some get 5% faster due to batched hashing, some utf cases get slower (the worst one 22% slower)

I think it's still a good idea to ignore the regression and simplify the code due to:

Amdahl's Law

If we make function X 20% faster, but function X only takes 1% of query time, then the complexity to win the performance might not be worthy: specifically the microbench only measured update_batch() function, this piece of code is highly vectorizable, and it can very unlikely to be significant on real queries.

I tried to construct a query that is very heavy on update_batch, still can't observe end-to-end difference:

> select approx_distinct(v1)
from (
select arrow_cast(v1, 'Utf8View')
from generate_series(100000000)
as t1(v1)) as t_string(v1);

+------------------------------+
| approx_distinct(t_string.v1) |
+------------------------------+
| 99201889                     |
+------------------------------+
1 row(s) fetched.
Elapsed 0.139 seconds.

-- Runtime almost the same on PR v.s. main

LLVM Optimization

For the slowest microbench, I think the root cause is that LLVM can optimize the manually simplified code more easily.

The existing implementation has the following fast path:

datafusion/datafusion/functions-aggregate/src/approx_distinct.rs

Lines 254 to 261 in b8998c7

    
           if array.data_buffers().is_empty() { 
        
               // Fast path: with no data buffers every value is inline, so they all 
        
               // take the u128 path — no need to check the length per row. 
        
               for (i, &view) in array.views().iter().enumerate() { 
        
                   if !array.is_null(i) { 
        
                       self.hll.add_hashed(HLL_HASH_STATE.hash_one(view)); 
        
                   } 
        
               }

The same optimization also exists in the common, simpler API create_hashes:

datafusion/datafusion/common/src/hash_utils.rs

Line 352 in b8998c7

if !HAS_BUFFERS || view_len <= 12 {

The existing implementation is still faster likely because the code is manually specialized, while create_hashes is more branchy. This makes LLVM easier to figure out how to optimize and bring 20% speedup.

However, this kind of optimization can be applied endlessly and would introduce complexity everywhere, so I do not think it is worth preserving here.

What changes are included in this PR?

Extend create_hashes with a hash state that is optimized for statistical quality
Simplify approx_distinct with create_hashes

Are these changes tested?

Are there any user-facing changes?

github-actions · 2026-06-12T03:35:45Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-common v54.0.0 (current)
       Built [  32.301s] (current)
     Parsing datafusion-common v54.0.0 (current)
      Parsed [   0.058s] (current)
    Building datafusion-common v54.0.0 (baseline)
       Built [  32.213s] (baseline)
     Parsing datafusion-common v54.0.0 (baseline)
      Parsed [   0.059s] (baseline)
    Checking datafusion-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.711s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure trait_method_requires_different_generic_type_params: trait method now requires a different number of generic type parameters ---

Description:
A trait method now requires a different number of generic type parameters than it used to. Calls or implementations of this trait method using the previous number of generic types will be broken.
        ref: https://doc.rust-lang.org/reference/items/generics.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_method_requires_different_generic_type_params.ron

Failed in:
  HashValue::hash_one (0 -> 1 generic types) in /home/runner/work/datafusion/datafusion/datafusion/common/src/hash_utils.rs:208

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  66.624s] datafusion-common
    Building datafusion-functions-aggregate v54.0.0 (current)
       Built [  29.396s] (current)
     Parsing datafusion-functions-aggregate v54.0.0 (current)
      Parsed [   0.043s] (current)
    Building datafusion-functions-aggregate v54.0.0 (baseline)
       Built [  29.215s] (baseline)
     Parsing datafusion-functions-aggregate v54.0.0 (baseline)
      Parsed [   0.046s] (baseline)
    Checking datafusion-functions-aggregate v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.198s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  59.911s] datafusion-functions-aggregate

2010YOUY01 added 3 commits June 12, 2026 10:58

refactor hll hash state

cfb50f8

simplify approx_distinct accumulator

f3f742a

fix format

daed520

github-actions Bot added common Related to common crate functions Changes to functions implementation labels Jun 12, 2026

github-actions Bot added the auto detected api change Auto detected API change label Jun 12, 2026

fix ci

4475574

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Simplify `approx_distinct` (-200 LoC)#22921

refactor: Simplify `approx_distinct` (-200 LoC)#22921
2010YOUY01 wants to merge 4 commits into
apache:mainfrom
2010YOUY01:cleanup-approx-distinct

2010YOUY01 commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if array.data_buffers().is_empty() {
	// Fast path: with no data buffers every value is inline, so they all
	// take the u128 path — no need to check the length per row.
	for (i, &view) in array.views().iter().enumerate() {
	if !array.is_null(i) {
	self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
	}
	}

Conversation

2010YOUY01 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Performance

Amdahl's Law

LLVM Optimization

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2010YOUY01 commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading