Skip to content

Conversation

@thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Dec 31, 2025

Fixes #14018

Description

#13842 added retry for R2C convert to prevent Host OOM. Following on from that, this pr aimed to add split and retry for GPU OOM when copying the converted results to GPU.

This PR:

  • Introduced HostColumnarBatchWithRowRange, a wrapper for host columns that supports logical slicing without copying underlying host memory. This allows splitting a large host batch into smaller chunks for transfer.
  • Updated GpuRowToColumnarExec to use split and retry when a GPU OOM occurs during transfer.

Note that when we split a batch into two halves, we can't free them until both halves are processed. So, the first half just borrows the data, but we pass the ownership of the host columns to the second half. This ensures the host memory stays alive exactly as long as needed and is freed when the last split is closed.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile full review

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds split retry functionality to Row-to-Columnar (R2C) conversion to prevent GPU out-of-memory (OOM) errors by implementing a mechanism to split batches in half when GPU OOM occurs during host-to-GPU data transfer.

Key Changes

  • Introduced a new HostColumnarBatchWithRowRange class that wraps host columns with row range tracking and supports splitting
  • Modified RowToColumnarIterator to use split retry logic for GPU OOM scenarios, allowing single input batches to produce multiple output batches
  • Added buildHostColumnsWithoutOwnership() method to transfer ownership of host columns to the retry framework

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala New class that wraps host columns with row range support, implements splitting logic for GPU OOM retry, and handles slicing of various column types (LIST, STRUCT, STRING, fixed-width)
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuRowToColumnarExec.scala Updated RowToColumnarIterator to build host columns separately, use split retry for TargetSize goals, and maintain a pending batch iterator for split outputs
sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java Added buildHostColumnsWithoutOwnership() method to transfer host column ownership to caller
tests/src/test/scala/com/nvidia/spark/rapids/RowToColumnarIteratorRetrySuite.scala Added comprehensive test coverage for GPU OOM split retry scenarios including single batch requirement enforcement, multiple batch production, multiple consecutive splits, and single row edge case

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 31, 2025

Greptile Summary

This PR adds GPU OOM handling with split-and-retry to Row-to-Columnar conversion, building on previous work that addressed Host OOM. The implementation introduces HostColumnarBatchWithRowRange, a wrapper that enables logical slicing of host column data without copying underlying memory, allowing large batches to be split into smaller chunks for GPU transfer when OOM occurs.

Key changes:

  • New HostColumnarBatchWithRowRange class supports logical row-range slicing with reference counting to manage memory lifecycle across splits
  • Modified RowToColumnarIterator to handle split outputs via pendingBatchIter, ensuring only one split transfers to GPU at a time
  • Integrated with existing RmmRapidsRetryIterator framework for split-and-retry logic
  • RequireSingleBatch mode correctly throws when splits are needed (cannot satisfy constraint)
  • Comprehensive test coverage including single/multiple splits, edge cases, and empty schemas

Implementation highlights:

  • Reference counting ensures host columns remain valid until all splits are processed
  • Complex slicing logic handles LIST, STRUCT, STRING, and fixed-width types with proper offset normalization
  • Split policy divides batches in half recursively until GPU transfer succeeds or reaches single-row limit

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk
  • The implementation properly handles GPU OOM with split-and-retry logic, uses correct reference counting for memory management, includes comprehensive test coverage for edge cases, and follows existing retry framework patterns. The code is well-structured with clear separation of concerns. Minor deduction for the complexity of manual reference counting which requires careful review to ensure correctness
  • Pay close attention to HostColumnarBatchWithRowRange.scala for reference counting correctness and sliceHostColumn implementation for nested types (LIST, STRUCT)

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala New class for logical slicing of host columns with row ranges; implements complex slicing logic for LIST, STRUCT, STRING, and fixed-width types; uses reference counting for memory management across splits
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuRowToColumnarExec.scala Modified RowToColumnarIterator to support GPU OOM split-and-retry; added pendingBatchIter to handle multiple output batches from splits; integrated with HostColumnarBatchWithRowRange for splitting host-to-GPU transfers
tests/src/test/scala/com/nvidia/spark/rapids/RowToColumnarIteratorRetrySuite.scala Added comprehensive test coverage for GPU OOM split-and-retry scenarios including: single split, multiple splits, single-row edge case, empty schema, and RequireSingleBatch constraint behavior

Sequence Diagram

sequenceDiagram
    participant RI as RowToColumnarIterator
    participant Builder as GpuColumnarBatchBuilder
    participant HB as HostColumnarBatchWithRowRange
    participant Retry as RmmRapidsRetryIterator
    participant GPU as GPU Memory
    
    Note over RI: next() called
    RI->>Builder: buildHostColumns()
    Builder-->>RI: hostColumns (refCount=1)
    RI->>HB: new (incRefCount)
    Note over HB: refCount=2
    RI->>RI: withResource closes hostColumns
    Note over HB: refCount=1
    
    alt RequireSingleBatch
        RI->>Retry: withRetryNoSplit(hostBatch)
        Retry->>HB: copyToGpu()
        HB->>GPU: Allocate & Transfer
        alt GPU OOM
            GPU-->>HB: GpuSplitAndRetryOOM
            Note over Retry: Cannot split with RequireSingleBatch
            Retry-->>RI: Throw exception
        else Success
            GPU-->>HB: ColumnarBatch
            HB-->>Retry: ColumnarBatch
            Retry-->>RI: ColumnarBatch
        end
    else TargetSize (allows splits)
        RI->>Retry: withRetry(hostBatch, splitPolicy)
        Retry->>HB: copyToGpu()
        HB->>GPU: Allocate & Transfer
        alt GPU OOM - Split Required
            GPU-->>HB: GpuSplitAndRetryOOM
            Retry->>HB: splitInHalf(batch)
            HB->>HB: withResource closes input batch
            Note over HB: refCount: 1 -> 0 (for old) + 2 (for splits) -> 2
            HB-->>Retry: Seq[split1, split2]
            Note over Retry: Process split1 first
            Retry->>HB: split1.copyToGpu()
            HB->>GPU: Allocate smaller chunk
            GPU-->>HB: ColumnarBatch1
            HB-->>Retry: ColumnarBatch1
            Retry-->>RI: Iterator with ColumnarBatch1
            RI->>RI: Store remaining splits in pendingBatchIter
            Note over RI: Next next() will process split2
        else Success
            GPU-->>HB: ColumnarBatch
            HB-->>Retry: ColumnarBatch
            Retry-->>RI: Iterator with ColumnarBatch
        end
    end
    
    RI->>RI: recordOutput(batch)
    RI-->>RI: Update metrics & refine targetRows
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala, line 176-184 (link)

    logic: Workaround condition may not cover all cases. If dataLen == 0 but nullCount > 0 (empty strings with some null values), the code goes to the else branch and creates a 0-length slice, which could trigger the same copyToDevice issue mentioned in the comment.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sameerz sameerz added the bug Something isn't working label Jan 4, 2026
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Jan 5, 2026
@thirtiseven thirtiseven marked this pull request as ready for review January 5, 2026 07:28
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala, line 178-188 (link)

    style: Check that this empty string workaround handles all edge cases correctly. The 1-byte allocation when dataLen == 0 && nullCount == 0 works around copyToDevice behavior, but verify it handles: (1) all strings in slice are empty (not null), (2) mixed empty strings at different positions. This edge case could cause issues with certain data distributions.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@wjxiz1992 wjxiz1992 changed the title Prevent potiential GPU OOM in R2C with split retry Prevent potential GPU OOM in R2C with split retry Jan 5, 2026
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala, line 180 (link)

    syntax: typo: "Existsing" should be "Existing"

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala, line 178-186 (link)

    style: Consider if allocating a 1-byte dummy buffer for empty strings when dataLen == 0 but numRows > nullCount is necessary, or if cudf would accept null here to save allocations.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

* Build host columns and transfer ownership to the caller. The builder will not close
* the returned host columns.
*/
public HostColumnVector[] buildHostColumnsWithoutOwnership() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems buildHostColumns already plays the same behavior, so is this duplicated ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can be removed if we use the refCount approach.

hostColumns = buildHostColumns();
}
HostColumnVector[] ret = hostColumns;
hostColumns = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting hostColumns to null will lead to trigger the call to "buildHostColumns" again when calling this "buildHostColumnsWithoutOwnership" API again, then an expcetion will be thrown out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can increase the refCount for buffer lifecycle management, then no need ths like WithoutOwnership.

val startRow: Int,
val numRows: Int,
val dataTypes: Array[DataType],
private var ownsHostColumns: Boolean) extends AutoCloseable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confused. I would suggest leveraging refCount instead of this boolean for buffers lifecycle management.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good idea, thanks. Updated to refCount.

* the last returned split. After calling this method, this instance no longer owns
* the columns and closing it will not free them.
*/
def splitInHalf(): Seq[HostColumnarBatchWithRowRange] = {
Copy link
Collaborator

@firestarman firestarman Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better to be a static method, i mean moving it to the ojbect HostColumnarBatchWithRowRange, along with refCount support, you can do split like

val hcb = XXX // hcb is a HostColumnarBatchWithRowRange (Maybe we can name it shorter as "SlicedHostColumnarBatch")
withReource(hcb) { _ => 
  HostColumnarBatchWithRowRange.splitInHalf(hcb)
}

or even in a simpler way

withReource(hcb)(HostColumnarBatchWithRowRange.splitInHalf)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@firestarman
Copy link
Collaborator

@greptileai can you review it again ?

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarBatchWithRowRange.scala, line 180-188 (link)

    style: Edge case handling for empty strings is correct: when dataLen == 0 but there are non-null rows (numRows > nullCount), allocates 1-byte buffer to satisfy cuDF requirements. However, consider validating that this 1-byte allocation doesn't cause issues if we're in a tight memory situation (though this is unlikely given the small size).

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Follow up] Fix potiential GPU OOM in R2C

3 participants