[FEA] explore using hyper-log-log to estimate the if we should continue with a partial aggregation.

**Is your feature request related to a problem? Please describe.**

In Spark if we are doing a distributed partial aggregation we technically don't need to complete the aggregation for the task to get the correct answer after the shuffle and final aggregation is complete.  The partial aggregation really is a performance optimization to try and reduce the size of the data before shuffling it. But at the same time there are cases where there really is no point in trying to combine all of the data together, because it will not reduce the size of the shuffled data.  Enough of the keys within a task are unique, so any computation that we do will just be wasted.

https://github.com/NVIDIA/spark-rapids/pull/10950 is an attempt to address this, but it has some serious flaws in that it assumes that keys for a task, that can be combined, will appear close to each other in the task's data. I think a lot of the time this is true, but not all of the time. Unless the window we are looking at is for the entire task we have no way to guarantee that combining more data would or would not be good.  If we know the total size of the data we can make some statistical assumptions, but we don't always know how large the data will be until we have seen it. And if we wait until the end of processing before we make a decision we will have already done a lot of processing and might have spilled a lot of data too.

https://github.com/NVIDIA/spark-rapids/issues/5199, https://github.com/rapidsai/cudf/issues/10652, and https://github.com/NVIDIA/cuCollections/pull/429 are issues to add in a HyperLogLog++ implementation on the GPU to help us do a very fast and efficient approximate count distinct operation. With that we should be able to make an informed decision at any point in time if the batches of data we have seen so far would combine to something smaller if we finished aggregating/combine them.

This does not solve all of the issues though. 

- In the common case we will only see a single input batch, or a small number of batches that likely combine together efficiently. We don't want to do extra computation unless we need to. 
- We also don't want to have to do a full first pass through all of the data before making some kind of a decision. For very large inputs with little to no chance of combining results we might generate a lot of memory pressure trying to keep everything in memory which can lead to a lot of spilling.
- We also need to think about sorting vs hash partitioning the intermediate data instead of sorting it https://github.com/NVIDIA/spark-rapids/issues/10370
- And finally we need to think about the optimization that was put in to avoid memory pressure when there are more output columns from an aggregation than input columns to it. In that case we may sort the input data proactively instead of taking the risk of sorting the intermediate data that appears would be larger.

**Describe the solution you'd like**
I propose that we keep some things the same as today. If we know that the input data is sorted start to do a first pass aggregation like we do today and release the batches as they are produced.

We also keep the sort fallback and heuristic for pre-sorting the data. The heuristic would be updated to do a approximate count distinct instead of a full count distinct though.  I think those can/should be addressed separately as a way to possibly reduce memory pressure and spill.

With the current code we do a full pass through all of the input data and put the results for each aggregation into a queue.  Once that pass is done, if there were multiple input batches, then we will try and combine them together through some merge aggregations. We will take groups of intermediate batches that add up to a size we know will fit in our GPU memory budget. We will concat them into a single batch, and then do the merge aggregation.  This will happen in a loop until we either have a single batch or no two consecutive batches could be combined together. They have to be consecutive batches for first/last to work properly. If there are still multiple batches after this second merge phase, then we will sort the data by the grouping keys and finally do a merge aggregation pass before outputting the results. This can result in three full passes through all of the input data, all of which is kept in memory, and might be spilled.

Instead I propose that we will start to do the initial pass through the data like we do today. But instead of stopping only after the first pass is complete. We will stop once we have enough data that it looks like it is worth trying to combine them. If that happens, then we will combine the results and calculate a HyperLogLog++ on the result. We will also start to calculate a HyperLogLog++ result for each subsequent batch we see. With these we can quickly estimate how much of a size reduction we will see if we combined multiple batches together, by combining the sketches and estimating the unique count. If the data looks like we can combine things and stay under our limit, then we should keep trying to combine them and update the HyperLogLog++ buffers for the newly produced batches. If they look like we cannot combine things to stay under our budget, then things start to be difficult, as we have a few choices.

1. We could sort the intermediate data + any new intermediate data that might be produced and output the batches from that. This might be replaced with hash partitioning in the future. https://github.com/NVIDIA/spark-rapids/issues/10370 
2. we could just output the intermediate data we have seen so far with no further processing, and then start over with aggregating the rest of the data.
3. We could finish doing an initial pass through all of the data + combining as needed before we make a choice. That choice will be made using the HypeLogLog++ estimates. Then we will choose option 1 or 2.

The upside of option 1 is that we produce a full output like Spark on the CPU would. The downside is that this is mostly what we do today and there was little point in calculating the HyperLogLog++'s if we go this route. It just might reduce the amount of memory pressure we are under in a few cases by more aggressively combining batches early instead of waiting until the first pass is complete.

The upside of option 2 is that we don't do a sort at all, which can be really expensive in terms of memory pressure and computation. Even with https://github.com/NVIDIA/spark-rapids/issues/10370 not doing something is going to be faster than doing something more efficiently. The downside is that there is an unknown probability that the size of the shuffle data will be larger.

The upside of option 3 is that we now have complete, or nearly complete information, before we make a choice.  The downside is that we have to have done a complete pass through the data, which can lead to increased memory pressure.

To balance these I would like to see us start with option 3, as the memory pressure is no worse than it is today and it might get to be better if we start to combine things aggressively. If we know that there are no first or last aggregations, we could even try to combine them out of order if the sketches indicate that would be good. 

If the data indicates that we have a lot of things that could be combined together based off of the ratio of approx_count_distinct vs the actual number of rows we have pending. Then we sort/re-partition to combine. If the data indicates that there is little if anything to combine, then we just start to release the partial aggregations that we have done so far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] explore using hyper-log-log to estimate the if we should continue with a partial aggregation. #11042

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] explore using hyper-log-log to estimate the if we should continue with a partial aggregation. #11042

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions