Add withRetry to GpuGenerateExec and GpuTextBasedPartitionReader by thirtiseven · Pull Request #13996 · NVIDIA/spark-rapids

thirtiseven · 2025-12-11T10:00:23Z

Contributes to #13672

Description

This PR covered 2 cases of device memory allocation with retry framework, to prevent potential GPU OOM. They are top cases found by running integration tests with #13995 enabled, covered 50% of found cases.

Also added test cases for them.

They should be easy ones so I combined them to one PR, but I'm happy to split them.

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps · 2025-12-11T10:06:49Z

Greptile Summary

This PR adds OOM retry protection to two GPU memory allocation hotspots: GpuGenerateExec and GpuTextBasedPartitionReader. The changes wrap memory allocation operations with the existing withRetry and withRetryNoSplit frameworks to handle GPU OOM conditions gracefully.

Key changes:

Refactored GpuGenerateExec.getSplits into GpuGenerateUtils.getSplitsWithRetryAndClose with withRetry protection that splits batches when OOM occurs
Changed GpuGenerateIterator to accept Iterator[Array[SpillableColumnarBatch]] instead of Seq[SpillableColumnarBatch] to support lazy evaluation from the retry iterator
Added GpuTextBasedPartitionReader.castToOutputTypesWithRetryAndClose wrapping castTableToDesiredTypes with withRetryNoSplit
Added GpuTextBasedPartitionReader.infer method to infer Spark types from cuDF columns for schema construction before retry
Both changes include comprehensive unit tests that inject OOM conditions to verify retry behavior

These changes contribute to issue #13672's goal of covering all memory allocation points with retry protection.

Confidence Score: 4/5

This PR is safe to merge with low risk
The implementation correctly applies the retry framework to protect against OOM conditions. The refactoring is well-structured and includes comprehensive tests. Previous thread comments about nullability have been addressed. Minor concern about the iterator lifecycle management, but the safeIteratorFromSeq pattern is used correctly.
No files require special attention

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala	Refactored `getSplits` to use `withRetry` framework for OOM protection, moved logic to `GpuGenerateUtils` for testability, and changed `GpuGenerateIterator` to handle iterator of arrays
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala	Added `infer` method for schema inference from cuDF columns and wrapped `castTableToDesiredTypes` with `withRetryNoSplit` for OOM protection
tests/src/test/scala/com/nvidia/spark/rapids/GpuGenerateSuite.scala	Updated existing tests to handle new iterator structure and added new test for `getSplitsWithRetry` OOM handling
tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite.scala	Added test for `castToOutputTypesWithRetryAndClose` OOM handling using OOM injection

Sequence Diagram

sequenceDiagram
    participant Caller
    participant GpuGenerateExec
    participant GpuGenerateUtils
    participant withRetry
    participant GpuGenerateIterator
    participant Generator

    Caller->>GpuGenerateExec: doGenerateAndClose(input)
    GpuGenerateExec->>GpuGenerateExec: projectAndCloseWithRetrySingleBatch()
    GpuGenerateExec->>GpuGenerateUtils: getSplitsWithRetryAndClose(projectedInput)
    GpuGenerateUtils->>GpuGenerateUtils: Create SpillableColumnarBatch
    GpuGenerateUtils->>withRetry: withRetry(batch, splitFunction)
    
    alt OOM occurs
        withRetry->>withRetry: Split batch in half
        withRetry->>GpuGenerateUtils: Retry with smaller batch
    end
    
    withRetry->>Generator: inputSplitIndices()
    Generator-->>withRetry: splitIndices
    withRetry->>GpuGenerateUtils: makeSplits(batch, indices)
    GpuGenerateUtils-->>GpuGenerateExec: Iterator[Array[SpillableColumnarBatch]]
    
    GpuGenerateExec->>GpuGenerateIterator: new(splits, generator)
    GpuGenerateIterator->>Caller: Iterator[ColumnarBatch]
    
    loop For each output batch
        Caller->>GpuGenerateIterator: hasNext/next()
        
        alt generateIter is empty
            GpuGenerateIterator->>GpuGenerateIterator: Get next bundle from inputs
            GpuGenerateIterator->>Generator: generate(safeIteratorFromSeq(bundle))
            Generator-->>GpuGenerateIterator: Iterator[ColumnarBatch]
        end
        
        GpuGenerateIterator->>Generator: Call generateIter.next()
        Generator-->>GpuGenerateIterator: ColumnarBatch
        GpuGenerateIterator-->>Caller: ColumnarBatch
    end

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

This PR adds retry framework support to three uncovered cases of GPU device memory allocation to prevent potential GPU OOM errors: batched bounded window computation, CSV cast operations, and generate split calculations.

Key changes:

Wrapped batched bounded window computation with withRetryNoSplit to handle OOM during window operations
Added retry logic to CSV cast table operations with withRetryNoSplit
Enhanced generate getSplits with withRetry using adaptive target size splitting on OOM

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/window/GpuBatchedBoundedWindowExec.scala	Wraps bounded window computation with withRetryNoSplit to handle OOM errors
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala	Adds retry logic to castTableToDesiredTypes for CSV parsing
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala	Implements retry with adaptive target size splitting for getSplits operation
tests/src/test/scala/com/nvidia/spark/rapids/WindowRetrySuite.scala	Adds test cases for bounded window retry on GpuRetryOOM and GpuSplitAndRetryOOM
tests/src/test/scala/com/nvidia/spark/rapids/GpuGenerateSuite.scala	Adds test for generate split-and-retry OOM handling
tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite.scala	Adds test for CSV cast table retry on OOM

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

thirtiseven · 2025-12-11T10:22:12Z

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2025-12-15T03:11:43Z

@greptile full review

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…xec.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps · 2025-12-22T03:08:56Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

firestarman · 2026-01-19T06:50:34Z

build

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

…PartitionReader.scala Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

firestarman · 2026-01-19T08:24:19Z

build

res-life · 2026-01-20T06:38:52Z

LGTM, just nits.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2026-01-20T06:56:02Z

build

res-life

LGTM

binmahone · 2026-01-21T07:22:11Z

LGTM

revans2 · 2026-01-21T16:41:26Z

+   * A INT32 column in cuDF may be from either YearMonthIntervalType or IntegerType
+   * in Spark.
+   */
+  def infer(col: ColumnView): DataType = col.getType match {


I would very much prefer it if we had a SpillableTable or SpillableCudfColumnArray instead of a SpillableColumanrBatch for this. The only reason we do a lot of SpillableColumnarBatch is because we know the Spark types almost everywhere. But internally they are only ever kept in CPU memory so we can recreate the same object as before. There is no reason to make up bogus Spark types so we can cache them in memory just so that we can pick them apart and then put them back together again afterwards. Or worse we end up using those fake types in places we should not.

This requires more work than I thought. It takes some time. ~~I am going to split it into a separate PR.~~

I agree that this is a lot of work. I am fine if we keep this as is for now and do the work in a follow on issue, so long as you make it very clear from the comments that this should not be used the way it is today anywhere else. That or you make a wrapper class to have a SpillableTable, that hides this as private methods with very clear comments about why you are doing it and what is the issue that is filed to clean it up/fix it.

It is ok, I am adding in the SpillableTable with the AI help.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

revans2 · 2026-01-22T16:52:33Z

This is now a problem. A kind of minor problem because we want to get rid of the old legacy op time, but the NVTX range does not actually cover all of the cases for the spit, because it can be computed lazily when the iterator calls next.

Thx, and updated to calculate the op time at two separate places where project and split are really executed.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

….scala Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

revans2

Just a nit at this point

revans2 · 2026-01-23T15:03:34Z

-      getSplits(projectedInput, othersProjectList, new RapidsConf(conf).gpuTargetBatchSizeBytes)
    }
+    val splits = GpuGenerateUtils.getSplitsWithRetryAndClose(projectedInput, generator,
+      othersProjectList.length, outer, new RapidsConf(conf).gpuTargetBatchSizeBytes, opTime)


new RapidsConf(conf) is not a cheap operation. At a minimum can we cache the result so that we don't take the hit for all batches?

It is always good to make the code better even it is not related to the main goal of the current PR.

nvauto · 2026-01-26T02:06:54Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

firestarman · 2026-01-26T07:58:08Z

build

This PR is filed to address some follow-ups for two recent PRs, including: - add in the `SpillableTable` to support spilling a cudf `Table` directly, avoiding converting it to a `ColumnarBatch`. This now is designed for the text based read when it does the schema casting, where we do not need a `ColumnarBatch`. `SpillabltTableHandle` has almost the same logic as the `SpillableColumnarBatchHandle`,the only difference is the internal `dev` is a `Table`, not a `ColumnarBatch`. (comment: #13996 (comment)) - add the comments to the `ceilDiv` in pre-split (comment: #14190 (comment)) - change to get the target batch size once in `GpuGenerateExec`. (comment: #13996 (comment)) Other changes are in tests. - Move all the common methods to a new trait named `SpillUnitTestBase` to share with its children. - Add unit tests for the new SpillableTable - Update `SpillFrameworkSuite` to extend from the new `SpillUnitTestBase`. --------- Signed-off-by: Firestarman <firestarmanllc@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Add withRetry to some uncovered cases of device memory allocation

18eede1

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven self-assigned this Dec 11, 2025

thirtiseven marked this pull request as ready for review December 11, 2025 10:02

thirtiseven requested a review from Copilot December 11, 2025 10:03

thirtiseven added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Dec 11, 2025

Copilot started reviewing on behalf of thirtiseven December 11, 2025 10:03 View session

greptile-apps Bot reviewed Dec 11, 2025

View reviewed changes

Copilot AI reviewed Dec 11, 2025

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/WindowRetrySuite.scala Outdated

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/WindowRetrySuite.scala Outdated

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated

thirtiseven added 2 commits December 15, 2025 11:09

address comments

27a66d4

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

243688b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 15, 2025 03:11

Copilot started reviewing on behalf of thirtiseven December 15, 2025 03:11 View session

greptile-apps Bot reviewed Dec 15, 2025

View reviewed changes

Copilot AI reviewed Dec 15, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated

Apply suggestions from code review

0b81ec5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps Bot reviewed Dec 15, 2025

View reviewed changes

style

cf76222

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps Bot reviewed Dec 15, 2025

View reviewed changes

thirtiseven requested a review from a team December 16, 2025 06:48

revans2 reviewed Dec 16, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Outdated

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/window/GpuBatchedBoundedWindowExec.scala Outdated

address commens

b88024d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from Copilot December 18, 2025 10:18

Copilot started reviewing on behalf of thirtiseven December 18, 2025 10:19 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateE…

f7bf679

…xec.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

thirtiseven requested a review from revans2 January 4, 2026 00:35

greptile-apps Bot reviewed Jan 19, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Outdated

firestarman added 2 commits January 19, 2026 14:59

address comments

118071c

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

fix a format issue

642585d

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

greptile-apps Bot reviewed Jan 19, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Outdated

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBased…

a448104

…PartitionReader.scala Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps Bot reviewed Jan 19, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala

res-life reviewed Jan 20, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Outdated

res-life reviewed Jan 20, 2026

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/GpuGenerateSuite.scala Outdated

address comments

efea4af

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

res-life previously approved these changes Jan 20, 2026

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/spark-rapids into pr-13996

3e72ca7

revans2 requested changes Jan 21, 2026

View reviewed changes

address comments for generate

de205cc

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman dismissed res-life’s stale review via de205cc January 22, 2026 06:22

revert a useless change

9c395d4

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

revans2 reviewed Jan 22, 2026

View reviewed changes

address new comments for GpuGenerateExec

de67cec

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

greptile-apps Bot reviewed Jan 23, 2026

View reviewed changes

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite.scala Outdated

Update tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite…

1c4d99d

….scala Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

revans2 approved these changes Jan 23, 2026

View reviewed changes

firestarman merged commit 1b8f203 into NVIDIA:main Jan 27, 2026
43 of 44 checks passed

firestarman mentioned this pull request Jan 27, 2026

Add in SpillableTable to the spilling framework #14208

Merged

Conversation

thirtiseven commented Dec 11, 2025 • edited by firestarman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklists

Uh oh!

greptile-apps Bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Dec 11, 2025

Uh oh!

thirtiseven commented Dec 15, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

greptile-apps Bot commented Dec 22, 2025

Greptile's behavior is changing!

Uh oh!

firestarman commented Jan 19, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

firestarman commented Jan 19, 2026

Uh oh!

Uh oh!

Uh oh!

res-life commented Jan 20, 2026

Uh oh!

firestarman commented Jan 20, 2026

Uh oh!

res-life left a comment

Choose a reason for hiding this comment

Uh oh!

binmahone commented Jan 21, 2026

Uh oh!

thirtiseven commented Dec 11, 2025 •

edited by firestarman

Loading

greptile-apps Bot commented Dec 11, 2025 •

edited

Loading

firestarman Jan 22, 2026 •

edited

Loading

firestarman Jan 23, 2026 •

edited

Loading