-
Notifications
You must be signed in to change notification settings - Fork 268
Support ScalaUDAF run on GPU[databricks] #13932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Greptile Summary
Important Files Changed
Confidence score: 4/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, no comments
|
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 1 comment
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/udaf.scala
Outdated
Show resolved
Hide resolved
|
build |
1 similar comment
|
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
|
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
|
build |
revans2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a pass through some of the code, but I am concerned about all of the special case handling that we are adding in for this. RapidsSimpleGroupByAggregation is very close to a CudfAggregate and that was on purpose. There would be so much less code if we could wrap the pre and post steps in expressions and if we could make RapidsSimpleGroupByAggregation look and act a lot like a CudfAggregate. I am mostly curious if there was a reason we didn't do it that way?
|
|
||
| withResource(aggTbl) { _ => | ||
| GpuColumnVector.from(aggTbl, postStepDataTypes.toArray) | ||
| // The output types of UDAF aggs can not be predicated, so need to infer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Spark do this for a Scala UDAF? Why are we not doing the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, we have a postStep which allows the reduce/aggregate produces data with an arbitrary schema. But Spark has no postStep so the output of aggregate should follow the aggregate buffer schema, just like the output of our postStep.
| // UDAF arguments | ||
| override final lazy val inputProjection: Seq[Expression] = children | ||
|
|
||
| override final lazy val initialValues: Seq[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not trying to wrap defaultValues as an expression? It looks like we are putting in a lot of code to special case GpuAggregateFunction when we could reuse a lot of existing code if we could just wrap the user supplied code in expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, but greptile found a potential use-after-close issue.
Here is the old version.
override final lazy val initialValues: Seq[Expression] = {
closeOnExcept(function.getDefaultValue) { udafDefValues =>
require(udafDefValues.length == aggBufferTypes.length,
s"The default values number (${udafDefValues.length}) is NOT equal to " +
s"the aggregation buffers number(${aggBufferTypes.length})")
udafDefValues.zip(aggBufferTypes).map { case (scalar, dt) =>
GpuLiteral(scalar, dt)
}
}
}
Comments from greptile
"""
greptile-apps bot
18 hours ago
logic: Storing cuDF Scalar objects directly in GpuLiteral may cause a use-after-free issue. When generateEmptyReductionBatch calls GpuScalar(litVal, dt) where litVal is a cuDF Scalar, it wraps without incrementing the reference count, then closes it via withResource. The next access to initialValues would use an already-closed Scalar.
"""
Thx for review and this is a good question. Personally the main reason is related to the length of arguments. all APIs defined in The current aggregate framework in GpuAggregateExec is designed for fixed-length of input and output for each aggregate stage, and this is OK for all the built-in aggregates, since we know all the details of each of them in advance. One more solution is to do some refactor to the current aggregate framework to allow multiple input/outputs to each CudfAggregate, but this will be a big change I think. So instead, i chose the solution in this PR, "appending" the UDAF support to the existing system. Anyway, i am happy to know if any better solutions that can reduce the change in GpuAggregateExec. |
|
@firestarman I'd like to take a look at this. I would love to find ways of reducing changes to the aggregate, unless we absolutely need them. |
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
@abellina can you take a look at this ? I tried my best to reduce the change in GPU Aggregate, i'd love to know if any suggestion to reduce more. |
|
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
|
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
|
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds GPU acceleration support for Scala UDAFs (User-Defined Aggregate Functions) by introducing three new classes and integrating them into the existing aggregation pipeline.
Key Changes
- New UDAF framework classes in
udaf.scala:UDAFAggregatewrapsRapidsUDAFGroupByAggregationfor GPU interaction,GpuUserDefinedAggregateFunctionprovides common GPU UDAF implementation, andGpuScalaUDAFis the GPU version ofScalaUDAF - GpuAggregateExec refactoring: Splits aggregation processing into built-in and UDAF paths throughout the entire pipeline (preProcess, reduce/aggregate, postProcess), handles variable-length outputs from UDAF operations, and reorders columns to match Spark's expectations
- Test coverage: Added comprehensive test suite with
IntAverageUDAFexample covering groupby, reduction, and empty dataset scenarios - Documentation: Updated supported operations and configuration documentation
Performance Impact
No regression detected in NDS benchmarks. Performance improvement for UDAF operations shows ~6x speedup (110s CPU vs 18.8s GPU) compared to CPU execution.
Confidence Score: 4/5
- This PR is safe to merge with some minor considerations
- The implementation is well-structured with proper resource management (closeOnExcept patterns), comprehensive test coverage, and no performance regression. The code handles variable-length UDAF outputs correctly and maintains column ordering. Score reduced by one point due to the complex nature of the changes affecting critical aggregation paths and the relatively new UDAF framework integration that would benefit from additional testing in production-like scenarios.
- Pay close attention to
GpuAggregateExec.scala- the UDAF integration touches critical aggregation logic with complex column reordering
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/udaf.scala | 4/5 | New file introducing UDAFAggregate, GpuUserDefinedAggregateFunction, and GpuScalaUDAF classes for GPU UDAF support with type inference utilities |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala | 4/5 | Major refactoring to integrate UDAF support throughout aggregation pipeline (preProcess, reduce, aggregate, postProcess), handles variable-length outputs |
| tests/src/test/scala/com/nvidia/spark/rapids/ScalaUDAFSuite.scala | 5/5 | New test suite with IntAverageUDAF implementation covering groupby, reduction, and empty dataset scenarios |
Sequence Diagram
sequenceDiagram
participant Client as Spark Query
participant GpuAggExec as GpuAggregateExec
participant AggHelper as AggHelper
participant UDAFAgg as UDAFAggregate
participant UDAF as GpuScalaUDAF/RapidsUDAF
participant cuDF as cuDF Native
Note over Client,cuDF: Aggregation Pipeline with UDAF Support
Client->>GpuAggExec: Execute aggregate query
GpuAggExec->>AggHelper: preProcess(inputBatch)
alt Has UDAF aggregates
AggHelper->>AggHelper: preStepBound.project (built-in aggs)
AggHelper->>UDAFAgg: preStepAndClose(args)
UDAFAgg->>UDAF: preStep(numRows, args)
UDAF->>cuDF: Transform columns (e.g., cast INT to LONG)
cuDF-->>UDAF: Transformed columns
UDAF-->>UDAFAgg: Array[ColumnVector]
UDAFAgg-->>AggHelper: Array[GpuColumnVector]
Note over AggHelper: Cache udafAggArgLens for next step
AggHelper-->>GpuAggExec: Combined batch (built-in + UDAF cols)
else No UDAF aggregates
AggHelper->>AggHelper: preStepBound.project only
AggHelper-->>GpuAggExec: Preprocessed batch
end
GpuAggExec->>AggHelper: aggregate/reduce(preprocessed)
alt Group-by aggregation
AggHelper->>cuDF: groupBy().aggregate(built-in aggs)
loop For each UDAF
AggHelper->>UDAFAgg: aggregate(inputIndices)
UDAFAgg->>UDAF: aggregate(inputIndices)
UDAF-->>UDAFAgg: GroupByAggregationOnColumn[]
end
AggHelper->>cuDF: groupBy().aggregate(all aggs)
cuDF-->>AggHelper: Aggregated table
Note over AggHelper: Cache udafPostStepArgLens
else Reduction
AggHelper->>cuDF: reductionAggregate(built-in aggs)
loop For each UDAF
AggHelper->>UDAFAgg: reduce(numRows, args)
UDAFAgg->>UDAF: reduce(numRows, args)
UDAF->>cuDF: sum(), count(), etc.
cuDF-->>UDAF: Scalars
UDAF-->>UDAFAgg: Array[Scalar]
UDAFAgg-->>AggHelper: Array[GpuScalar]
end
Note over AggHelper: Cache udafPostStepArgLens
AggHelper-->>GpuAggExec: Reduced batch
end
GpuAggExec->>AggHelper: postProcess(aggregated)
alt Has UDAF aggregates
AggHelper->>AggHelper: postStepBound.project (built-in)
loop For each UDAF
AggHelper->>UDAFAgg: postStepAndClose(args)
UDAFAgg->>UDAF: postStep(numRows, args)
UDAF-->>UDAFAgg: Array[ColumnVector]
UDAFAgg-->>AggHelper: Array[GpuColumnVector]
end
AggHelper->>AggHelper: mergeWithOriginalOrderAndClose
Note over AggHelper: Reorder columns to match Spark expectation
AggHelper-->>GpuAggExec: Post-processed batch
else No UDAF aggregates
AggHelper->>AggHelper: postStepBound.project only
AggHelper-->>GpuAggExec: Post-processed batch
end
alt Final/Complete mode
GpuAggExec->>UDAF: resultEvalAndClose(args)
UDAF->>cuDF: getResult (e.g., sum/count)
cuDF-->>UDAF: Final result column
UDAF-->>GpuAggExec: Result column
end
GpuAggExec-->>Client: Final result
Contributes to #13412
Rapids UDAF is designed to support executing an UDAF (User Defined Aggregate Function) in the columnar way to get accelerated by GPU.
Complete support of RapidsUDAF covers too many things and a single PR (#13450) is too large to review. So instead it's better to be added in piece by piece.
And this PR is the second one (the first one is at #13870) who adds in basic support to let
ScalaUDAFrun on GPU. It covers two main changes.UDAFAggregateto connect theRapidsUDAFGroupByAggregationwith the GPU aggregate framework,GpuUserDefinedAggregateFunctionto connect theRapidsUDAFwith the GPU aggregate framework,GpuScalaUDAF, the GPU version ofScalaUDAF.GpuAggregateExecto support UDAF things for the whole aggregation process.Perf Regerssion
No obvious perf regression is found according to the NDS runs as below: (in seconds)
Perf Improvement
The following performance numbers are got from a local test. (in seconds)
Test dataset schema:
Test dataset size:
More details: