Skip to content

Add withRetry to GpuGenerateExec and GpuTextBasedPartitionReader#13996

Merged
firestarman merged 26 commits intoNVIDIA:mainfrom
thirtiseven:device_retry_oom_cases
Jan 27, 2026
Merged

Add withRetry to GpuGenerateExec and GpuTextBasedPartitionReader#13996
firestarman merged 26 commits intoNVIDIA:mainfrom
thirtiseven:device_retry_oom_cases

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

@thirtiseven thirtiseven commented Dec 11, 2025

Contributes to #13672

Description

This PR covered 2 cases of device memory allocation with retry framework, to prevent potential GPU OOM. They are top cases found by running integration tests with #13995 enabled, covered 50% of found cases.

Also added test cases for them.

They should be easy ones so I combined them to one PR, but I'm happy to split them.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Dec 11, 2025
@thirtiseven thirtiseven marked this pull request as ready for review December 11, 2025 10:02
@thirtiseven thirtiseven requested a review from Copilot December 11, 2025 10:03
@thirtiseven thirtiseven added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Dec 11, 2025
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Dec 11, 2025

Greptile Summary

This PR adds OOM retry protection to two GPU memory allocation hotspots: GpuGenerateExec and GpuTextBasedPartitionReader. The changes wrap memory allocation operations with the existing withRetry and withRetryNoSplit frameworks to handle GPU OOM conditions gracefully.

Key changes:

  • Refactored GpuGenerateExec.getSplits into GpuGenerateUtils.getSplitsWithRetryAndClose with withRetry protection that splits batches when OOM occurs
  • Changed GpuGenerateIterator to accept Iterator[Array[SpillableColumnarBatch]] instead of Seq[SpillableColumnarBatch] to support lazy evaluation from the retry iterator
  • Added GpuTextBasedPartitionReader.castToOutputTypesWithRetryAndClose wrapping castTableToDesiredTypes with withRetryNoSplit
  • Added GpuTextBasedPartitionReader.infer method to infer Spark types from cuDF columns for schema construction before retry
  • Both changes include comprehensive unit tests that inject OOM conditions to verify retry behavior

These changes contribute to issue #13672's goal of covering all memory allocation points with retry protection.

Confidence Score: 4/5

  • This PR is safe to merge with low risk
  • The implementation correctly applies the retry framework to protect against OOM conditions. The refactoring is well-structured and includes comprehensive tests. Previous thread comments about nullability have been addressed. Minor concern about the iterator lifecycle management, but the safeIteratorFromSeq pattern is used correctly.
  • No files require special attention

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Refactored getSplits to use withRetry framework for OOM protection, moved logic to GpuGenerateUtils for testability, and changed GpuGenerateIterator to handle iterator of arrays
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Added infer method for schema inference from cuDF columns and wrapped castTableToDesiredTypes with withRetryNoSplit for OOM protection
tests/src/test/scala/com/nvidia/spark/rapids/GpuGenerateSuite.scala Updated existing tests to handle new iterator structure and added new test for getSplitsWithRetry OOM handling
tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite.scala Added test for castToOutputTypesWithRetryAndClose OOM handling using OOM injection

Sequence Diagram

sequenceDiagram
    participant Caller
    participant GpuGenerateExec
    participant GpuGenerateUtils
    participant withRetry
    participant GpuGenerateIterator
    participant Generator

    Caller->>GpuGenerateExec: doGenerateAndClose(input)
    GpuGenerateExec->>GpuGenerateExec: projectAndCloseWithRetrySingleBatch()
    GpuGenerateExec->>GpuGenerateUtils: getSplitsWithRetryAndClose(projectedInput)
    GpuGenerateUtils->>GpuGenerateUtils: Create SpillableColumnarBatch
    GpuGenerateUtils->>withRetry: withRetry(batch, splitFunction)
    
    alt OOM occurs
        withRetry->>withRetry: Split batch in half
        withRetry->>GpuGenerateUtils: Retry with smaller batch
    end
    
    withRetry->>Generator: inputSplitIndices()
    Generator-->>withRetry: splitIndices
    withRetry->>GpuGenerateUtils: makeSplits(batch, indices)
    GpuGenerateUtils-->>GpuGenerateExec: Iterator[Array[SpillableColumnarBatch]]
    
    GpuGenerateExec->>GpuGenerateIterator: new(splits, generator)
    GpuGenerateIterator->>Caller: Iterator[ColumnarBatch]
    
    loop For each output batch
        Caller->>GpuGenerateIterator: hasNext/next()
        
        alt generateIter is empty
            GpuGenerateIterator->>GpuGenerateIterator: Get next bundle from inputs
            GpuGenerateIterator->>Generator: generate(safeIteratorFromSeq(bundle))
            Generator-->>GpuGenerateIterator: Iterator[ColumnarBatch]
        end
        
        GpuGenerateIterator->>Generator: Call generateIter.next()
        Generator-->>GpuGenerateIterator: ColumnarBatch
        GpuGenerateIterator-->>Caller: ColumnarBatch
    end
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds retry framework support to three uncovered cases of GPU device memory allocation to prevent potential GPU OOM errors: batched bounded window computation, CSV cast operations, and generate split calculations.

Key changes:

  • Wrapped batched bounded window computation with withRetryNoSplit to handle OOM during window operations
  • Added retry logic to CSV cast table operations with withRetryNoSplit
  • Enhanced generate getSplits with withRetry using adaptive target size splitting on OOM

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/window/GpuBatchedBoundedWindowExec.scala Wraps bounded window computation with withRetryNoSplit to handle OOM errors
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Adds retry logic to castTableToDesiredTypes for CSV parsing
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Implements retry with adaptive target size splitting for getSplits operation
tests/src/test/scala/com/nvidia/spark/rapids/WindowRetrySuite.scala Adds test cases for bounded window retry on GpuRetryOOM and GpuSplitAndRetryOOM
tests/src/test/scala/com/nvidia/spark/rapids/GpuGenerateSuite.scala Adds test for generate split-and-retry OOM handling
tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite.scala Adds test for CSV cast table retry on OOM

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/WindowRetrySuite.scala Outdated
Comment thread tests/src/test/scala/com/nvidia/spark/rapids/WindowRetrySuite.scala Outdated
Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile full review

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated
Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@thirtiseven thirtiseven requested a review from a team December 16, 2025 06:48
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated
…xec.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Dec 22, 2025

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@thirtiseven thirtiseven requested a review from revans2 January 4, 2026 00:35
@firestarman
Copy link
Copy Markdown
Collaborator

build

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

…PartitionReader.scala

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@firestarman
Copy link
Copy Markdown
Collaborator

build

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/GpuGenerateSuite.scala Outdated
@res-life
Copy link
Copy Markdown
Collaborator

LGTM, just nits.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Copy Markdown
Collaborator

build

res-life
res-life previously approved these changes Jan 20, 2026
Copy link
Copy Markdown
Collaborator

@res-life res-life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@binmahone
Copy link
Copy Markdown
Collaborator

LGTM

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGenerateExec.scala Outdated
* A INT32 column in cuDF may be from either YearMonthIntervalType or IntegerType
* in Spark.
*/
def infer(col: ColumnView): DataType = col.getType match {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much prefer it if we had a SpillableTable or SpillableCudfColumnArray instead of a SpillableColumanrBatch for this. The only reason we do a lot of SpillableColumnarBatch is because we know the Spark types almost everywhere. But internally they are only ever kept in CPU memory so we can recreate the same object as before. There is no reason to make up bogus Spark types so we can cache them in memory just so that we can pick them apart and then put them back together again afterwards. Or worse we end up using those fake types in places we should not.

Copy link
Copy Markdown
Collaborator

@firestarman firestarman Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires more work than I thought. It takes some time. I am going to split it into a separate PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a lot of work. I am fine if we keep this as is for now and do the work in a follow on issue, so long as you make it very clear from the comments that this should not be used the way it is today anywhere else. That or you make a wrapper class to have a SpillableTable, that hides this as private methods with very clear comments about why you are doing it and what is the issue that is filed to clean it up/fix it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok, I am adding in the SpillableTable with the AI help.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a problem. A kind of minor problem because we want to get rid of the old legacy op time, but the NVTX range does not actually cover all of the cases for the spit, because it can be computed lazily when the iterator calls next.

Copy link
Copy Markdown
Collaborator

@firestarman firestarman Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, and updated to calculate the op time at two separate places where project and split are really executed.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread tests/src/test/scala/com/nvidia/spark/rapids/CsvScanRetrySuite.scala Outdated
….scala

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit at this point

getSplits(projectedInput, othersProjectList, new RapidsConf(conf).gpuTargetBatchSizeBytes)
}
val splits = GpuGenerateUtils.getSplitsWithRetryAndClose(projectedInput, generator,
othersProjectList.length, outer, new RapidsConf(conf).gpuTargetBatchSizeBytes, opTime)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new RapidsConf(conf) is not a cheap operation. At a minimum can we cache the result so that we don't take the hit for all batches?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always good to make the code better even it is not related to the main goal of the current PR.

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented Jan 26, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

@firestarman
Copy link
Copy Markdown
Collaborator

build

@firestarman firestarman merged commit 1b8f203 into NVIDIA:main Jan 27, 2026
43 of 44 checks passed
firestarman added a commit that referenced this pull request Jan 28, 2026
This PR is filed to address some follow-ups for two recent PRs,
including:

- add in the `SpillableTable` to support spilling a cudf `Table`
directly, avoiding converting it to a `ColumnarBatch`. This now is
designed for the text based read when it does the schema casting, where
we do not need a `ColumnarBatch`. `SpillabltTableHandle` has almost the
same logic as the `SpillableColumnarBatchHandle`,the only difference is
the internal `dev` is a `Table`, not a `ColumnarBatch`. (comment:
#13996 (comment))
- add the comments to the `ceilDiv` in pre-split (comment:
#14190 (comment))
- change to get the target batch size once in `GpuGenerateExec`.
(comment:
#13996 (comment))

Other changes are in tests.
- Move all the common methods to a new trait named `SpillUnitTestBase`
to share with its children.
- Add unit tests for the new SpillableTable
- Update `SpillFrameworkSuite` to extend from the new
`SpillUnitTestBase`.

---------

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants