support withRetry with split for shuffle exchange exec base by zpuller · Pull Request #13975 · NVIDIA/spark-rapids

zpuller · 2025-12-08T17:29:15Z

Fixes #13951 (along with #14010)

Description

We convert the internal state of prepareBatchShuffleDependency in rddWithPartitionIds into a spliterator to support split retires. This happens a level above where we call into the partitioner which does the gpu kudo serialization.

Performance testing so far shows no significant change for low memory scenarios in NDS. I'll continue with perf testing while the PR is under review.

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller · 2025-12-08T18:00:51Z

build

zpuller · 2025-12-08T22:30:07Z

build

Signed-off-by: Zach Puller <zpuller@nvidia.com>

Signed-off-by: Zach Puller <zpuller@nvidia.com> Signed-off-by: Zach Puller <zpuller@nvidia.com>

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller · 2025-12-10T17:00:50Z

build

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller · 2025-12-10T17:44:08Z

build

zpuller · 2025-12-10T18:19:00Z

build

Signed-off-by: Zach Puller <zpuller@nvidia.com>

sameerz · 2025-12-11T00:48:49Z

build

zpuller · 2025-12-11T17:21:13Z

build

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller · 2025-12-15T23:12:46Z

build

greptile-apps · 2025-12-15T23:20:32Z

Greptile Summary

Refactors GPU Kudo shuffle operations to support split retries at the shuffle exchange level for better memory management in low-memory GPU environments
Removes withRetryNoSplit retry mechanisms from individual partitioners (GpuRangePartitioner, GpuHashPartitioningBase, GpuRoundRobinPartitioning) and centralizes retry logic in GpuShuffleExchangeExecBase
Updates shuffle exchange iterator state management to handle multiple partition result arrays when batch splitting occurs, enabling OOM recovery through half-batch splitting

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuShuffleExchangeExecBase.scala	Refactored iterator state management to support split retries with `partitionedIter` for handling multiple result arrays from split operations
tests/src/test/scala/com/nvidia/spark/rapids/GpuKudoWritePartitioningSuite.scala	New comprehensive test suite to validate split retry functionality with OOM injection and data integrity verification

Confidence score: 4/5

This PR is mostly safe to merge with some complexity concerns around state management changes
Score reflects thorough testing and solid architectural improvements, with minor deduction for complex iterator logic changes that require careful review
Pay close attention to the iterator state management changes in GpuShuffleExchangeExecBase.scala and the new test suite validation logic

Sequence Diagram

sequenceDiagram
    participant User
    participant GpuShuffleExchangeExecBase as "GpuShuffleExchangeExecBase"
    participant prepareBatchShuffleDependency as "prepareBatchShuffleDependency"
    participant RddWithPartitionIds as "rddWithPartitionIds"
    participant withRetry as "withRetry"
    participant GpuPartitioning as "GpuPartitioning"
    participant RmmSpark as "RmmSpark"

    User->>GpuShuffleExchangeExecBase: "execute shuffle"
    GpuShuffleExchangeExecBase->>prepareBatchShuffleDependency: "create shuffle dependency"
    prepareBatchShuffleDependency->>RddWithPartitionIds: "create partitioned RDD"
    
    RddWithPartitionIds->>RddWithPartitionIds: "iterate batches"
    RddWithPartitionIds->>withRetry: "withRetry(spillableBatch, splitSpillableInHalfByRows)"
    
    withRetry->>GpuPartitioning: "columnarEvalAny(batch)"
    
    alt Success Case
        GpuPartitioning-->>withRetry: "partitioned data"
        withRetry-->>RddWithPartitionIds: "Array[(ColumnarBatch, Int)]"
    else OOM Exception
        GpuPartitioning->>RmmSpark: "throw GpuSplitAndRetryOOM"
        RmmSpark-->>withRetry: "OOM exception"
        withRetry->>withRetry: "splitSpillableInHalfByRows(spillableBatch)"
        withRetry->>GpuPartitioning: "columnarEvalAny(splitBatch1)"
        GpuPartitioning-->>withRetry: "partitioned data 1"
        withRetry->>GpuPartitioning: "columnarEvalAny(splitBatch2)" 
        GpuPartitioning-->>withRetry: "partitioned data 2"
        withRetry-->>RddWithPartitionIds: "Array[(ColumnarBatch, Int)] (split results)"
    end
    
    RddWithPartitionIds-->>prepareBatchShuffleDependency: "Product2[Int, ColumnarBatch]"
    prepareBatchShuffleDependency-->>GpuShuffleExchangeExecBase: "ShuffleDependency"
    GpuShuffleExchangeExecBase-->>User: "shuffled data"

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

abellina

I'd like us to add a bit the test, unless I've missed it.

abellina · 2025-12-17T15:09:19Z

+        s"Expected at least one split retry, but saw $retryCount retries")
+
+      // Verify batch contents match expected data (even after split retry)
+      verifyBatchContents(allPartitionedBatches, totalRowsSeen, serializer,


I think we should verify that the number of partitioned arrays seen is double given the split, than without the split.

To me trying to verify this behavior in an integration test is that is not overloading the memory system is not going to work. We have to have a unit test of some kind and we can use the RMM injection code to verify that we are doing the right thing.

Added a comment here at the site where we could be doing verification, without looking at the matrics: https://github.com/NVIDIA/spark-rapids/pull/13975/files#r2631405357

Signed-off-by: Zach Puller <zpuller@nvidia.com>

abellina

I don't really think we need a new metric: NUM_PARTITIONED_ARRAYS

We already have NUM_OUTPUT_BATCHES https://github.com/NVIDIA/spark-rapids/pull/13975/files#diff-2519047533f3504238e111c11b9d2a55903af1dfd89ea01be31233f15315dc01R445.

In this case we'll output more batches, which would be a way to check in the tests.

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Additional Comments (1)

tests/src/test/scala/com/nvidia/spark/rapids/GpuKudoWritePartitioningSuite.scala, line 314 (link)

style: Redundant cast - intVal is already of type Integer

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps

Additional Comments (1)

tests/src/test/scala/com/nvidia/spark/rapids/GpuKudoWritePartitioningSuite.scala, line 467-470 (link)

style: Verify that the test actually triggers the split retry path as intended. The test injects OOM on the first next() call and validates numNextCalls == 3 (one batch splits into 2, plus 1 unsplit batch). Check that the retry count assertion on line 476 consistently passes across different environments.

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

abellina

I would like us to improve the suite further

Signed-off-by: Zach Puller <zpuller@nvidia.com>

abellina

Thanks @zpuller

zpuller · 2025-12-19T20:08:19Z

build

zpuller · 2025-12-21T21:16:28Z

build

zpuller · 2025-12-22T05:23:21Z

build

Fixes #13951 (along with #13975) ### Description Use `AutoCloseableTargetSize` to create a split policy which would halve the target size of an incoming batch when doing a split retry. Performance testing so far shows no significant change for low memory scenarios in NDS. I'll continue with perf testing while the PR is under review. ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [x] This PR has added new tests or modified existing tests to cover new code paths. (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.) - [x] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. --------- Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller added 2 commits December 7, 2025 19:40

support withRetry with split for shuffle exchange exec base

bb27acc

Signed-off-by: Zach Puller <zpuller@nvidia.com>

format

4e26447

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller added 4 commits December 8, 2025 15:31

add basic gpu kudo write partitioning unit test

6f8eb29

Signed-off-by: Zach Puller <zpuller@nvidia.com>

exercise prepareBatchShuffleDependency in partitioning test

2425e3e

Signed-off-by: Zach Puller <zpuller@nvidia.com> Signed-off-by: Zach Puller <zpuller@nvidia.com>

add split retry test

42cf146

Signed-off-by: Zach Puller <zpuller@nvidia.com>

fix test

abe1405

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller force-pushed the kudo_split branch from 51c32b6 to abe1405 Compare December 9, 2025 19:25

cleaning

504c3d8

Signed-off-by: Zach Puller <zpuller@nvidia.com>

remove nested withRetry

8165520

Signed-off-by: Zach Puller <zpuller@nvidia.com>

rm ShufflePartitionerRetrySuite since that behavior is removed

04a5181

Signed-off-by: Zach Puller <zpuller@nvidia.com>

Merge branch 'main' into kudo_split

95806d0

abellina reviewed Dec 11, 2025

View reviewed changes

Comment thread ...plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuShuffleExchangeExecBase.scala Outdated

zpuller mentioned this pull request Dec 15, 2025

support withRetry with split for GPU shuffle coalesce #14010

Merged

3 tasks

zpuller changed the title ~~[DRAFT] support withRetry with split for shuffle exchange exec base~~ support withRetry with split for shuffle exchange exec base Dec 15, 2025

zpuller marked this pull request as ready for review December 15, 2025 23:08

pr comment

ff9f44e

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Dec 15, 2025

View reviewed changes

zpuller requested review from a team and abellina December 15, 2025 23:22

revans2 previously approved these changes Dec 17, 2025

View reviewed changes

abellina reviewed Dec 17, 2025

View reviewed changes

add extra validation on partition count to tests

7fcee85

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller dismissed revans2’s stale review via 7fcee85 December 17, 2025 18:52

zpuller requested a review from abellina December 17, 2025 18:52

abellina reviewed Dec 17, 2025

View reviewed changes

use num batches instead for verification, revert adding new metric

4bc0d97

Signed-off-by: Zach Puller <zpuller@nvidia.com>

abellina reviewed Dec 18, 2025

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/GpuKudoWritePartitioningSuite.scala

use next calls instead of metric for validation

b27437d

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Dec 18, 2025

View reviewed changes

add split batch verification logic

10092a8

Signed-off-by: Zach Puller <zpuller@nvidia.com>

greptile-apps Bot reviewed Dec 18, 2025

View reviewed changes

abellina reviewed Dec 19, 2025

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/GpuKudoWritePartitioningSuite.scala Outdated

abellina reviewed Dec 19, 2025

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/GpuKudoWritePartitioningSuite.scala Outdated

abellina requested changes Dec 19, 2025

View reviewed changes

zpuller added 4 commits December 19, 2025 10:15

generalize verifySplitRetryStructure

c84ae50

Signed-off-by: Zach Puller <zpuller@nvidia.com>

test for 2 partitions

42650b0

Signed-off-by: Zach Puller <zpuller@nvidia.com>

generalize test for different numPartitions

d7e243d

Signed-off-by: Zach Puller <zpuller@nvidia.com>

rm comment

fcaaa08

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller requested a review from abellina December 19, 2025 19:28

abellina approved these changes Dec 19, 2025

View reviewed changes

zpuller merged commit 1ab486e into NVIDIA:main Dec 22, 2025
44 checks passed

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Jan 3, 2026

zpuller deleted the kudo_split branch January 5, 2026 17:49

zpuller mentioned this pull request Jan 12, 2026

[BUG] gpu kudo does not make its inputs spillable #13954

Closed

Conversation

zpuller commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklists

Uh oh!

zpuller commented Dec 8, 2025

Uh oh!

zpuller commented Dec 8, 2025

Uh oh!

zpuller commented Dec 10, 2025

Uh oh!

zpuller commented Dec 10, 2025

Uh oh!

zpuller commented Dec 10, 2025

Uh oh!

sameerz commented Dec 11, 2025

Uh oh!

zpuller commented Dec 11, 2025

Uh oh!

Uh oh!

zpuller commented Dec 15, 2025

Uh oh!

greptile-apps Bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

abellina Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

revans2 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

abellina Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

zpuller commented Dec 19, 2025

Uh oh!

zpuller commented Dec 21, 2025

Uh oh!

zpuller commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

zpuller commented Dec 8, 2025 •

edited

Loading

greptile-apps Bot commented Dec 15, 2025 •

edited

Loading

greptile-apps Bot left a comment •

edited

Loading

greptile-apps Bot left a comment •

edited

Loading