Skip to content

[FEA] Support for approx_count_distinct #10652

@andygrove

Description

@andygrove

Is your feature request related to a problem? Please describe.
I would like to be able to implement a GPU version of Spark's approx_count_distinct function, which uses the HyperLogLog++ cardinality estimation algorithm.

cuDF does not appear to provide any features today that would allow me to do this.

Describe the solution you'd like
I would like cuDF to implement this capability and expose an API that is likely similar to approx_percentile in that there would be methods both for computing and merging the underlying data structure, whether that is based on HyperLogLog++ or some other algorithm.

Describe alternatives you've considered
None

Additional context
None

Metadata

Metadata

Labels

SparkFunctionality that helps Spark RAPIDSfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions