Skip to content

Conversation

@thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Dec 11, 2025

Contributes to #13672
Depends on NVIDIA/spark-rapids-jni#4061

Description

This PR add a debug option to check if memory allocation is covered by retry framework, so we can find if we have code that will do a host/device memory allocation is covered by retry framework, which is potentially cause OOM.

The option is default off by default and should be only useful for developers. We can run it from time to time to find uncovered cases, or set up a CI to run periodically.

Here is the python script to convert the csv to a flamegraph: https://gist.github.com/thirtiseven/f72e14c1a860dc40a75f1351c3616a68

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
…framework

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Dec 11, 2025
@thirtiseven thirtiseven marked this pull request as draft December 11, 2025 08:44
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 11, 2025

Greptile Summary

This PR introduces a debug-only feature to track memory allocations not covered by the retry framework. When enabled via SPARK_RAPIDS_RETRY_COVERAGE_TRACKING=true, the system monitors both host and device memory allocations and logs uncovered cases (allocations occurring outside withRetry/withRetryNoSplit blocks) to a CSV file for analysis.

Key changes:

  • Added AllocationRetryCoverageTracker - new utility that captures stack traces for uncovered allocations
  • Modified RetryStateTracker to track retry block depth using thread-local counters
  • Hooked into allocation paths: HostAlloc.tryAllocInternal for host memory and DeviceMemoryEventHandler.onAllocated for device memory
  • Enabled RMM debug mode conditionally when tracking is active to receive allocation callbacks
  • Added test script support for the new environment variable

The feature is disabled by default and only activates when the environment variable is explicitly set, making it suitable for periodic CI checks or developer debugging without impacting production performance.

Confidence Score: 4/5

  • This PR is safe to merge with low risk - it's a debug-only feature that's disabled by default
  • The implementation is well-designed with proper thread safety, opt-in activation, and minimal performance impact when disabled. The code correctly handles nested retry blocks with depth tracking and uses appropriate synchronization for file I/O. The only concern is the lack of tests for the new functionality, though this is acceptable for a debug-only tool
  • No files require special attention - all changes are straightforward and well-documented

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala New debug-only utility for tracking memory allocations not covered by retry framework
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala Added retry block depth tracking and moved cleanup to finally block

Sequence Diagram

sequenceDiagram
    participant App as Application Code
    participant Retry as RmmRapidsRetryIterator
    participant Tracker as RetryStateTracker
    participant Host as HostAlloc
    participant Device as DeviceMemoryEventHandler
    participant Coverage as AllocationRetryCoverageTracker
    participant File as CSV Output
    
    Note over App,File: When SPARK_RAPIDS_RETRY_COVERAGE_TRACKING=true
    
    App->>Retry: withRetry { ... }
    activate Retry
    Retry->>Tracker: enterRetryBlock()
    Tracker->>Tracker: increment depth counter
    
    Retry->>App: execute user code
    activate App
    
    alt Host Memory Allocation
        App->>Host: allocate host memory
        Host->>Coverage: checkHostAllocation()
        Coverage->>Tracker: isInRetryBlock?
        alt Covered (depth > 0)
            Tracker-->>Coverage: true
            Coverage-->>Host: OK (covered)
        else Uncovered (depth == 0)
            Tracker-->>Coverage: false
            Coverage->>Coverage: capture stack trace
            Coverage->>File: write to CSV
            Coverage->>Coverage: log warning
        end
    else Device Memory Allocation
        App->>Device: allocate device memory
        Device->>Coverage: checkDeviceAllocation()
        Coverage->>Tracker: isInRetryBlock?
        alt Covered (depth > 0)
            Tracker-->>Coverage: true
            Coverage-->>Device: OK (covered)
        else Uncovered (depth == 0)
            Tracker-->>Coverage: false
            Coverage->>Coverage: capture stack trace
            Coverage->>File: write to CSV
            Coverage->>Coverage: log warning
        end
    end
    
    deactivate App
    Retry->>Tracker: exitRetryBlock()
    Tracker->>Tracker: decrement depth counter
    deactivate Retry
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven changed the title [Do not review] Add a debug option to check if memory allocation is covered by retry framework Add a debug option to check if memory allocation is covered by retry framework Dec 12, 2025
@thirtiseven
Copy link
Collaborator Author

@thirtiseven thirtiseven self-assigned this Dec 12, 2025
thirtiseven added a commit to NVIDIA/spark-rapids-jni that referenced this pull request Dec 16, 2025
Contributes to NVIDIA/spark-rapids#13672

For debugging, I'd like to expose an `enableDebug` parameter in
`RmmSpark.setEventHandler` to check if every memory allocation is
covered withRetry in spark-rapids. See
NVIDIA/spark-rapids#13995

<!--

Thank you for contributing to RAPIDS Accelerator for Apache Spark!

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are
being
   made.

2. Please ensure that you have written units tests for the changes
made/features
   added.

3. If you are closing an issue please use one of the automatic closing
words as
noted here:
https://help.github.com/articles/closing-issues-using-keywords/

4. If your pull request is not ready for review but you want to make use
of the
continuous integration testing facilities please create a draft pull
rqeuest
   or prefix the pull request summary with `[WIP]`.

5. If your pull request is ready to be reviewed without requiring
additional
   work on top of it then remove any `[WIP]` prefix in the summary and
   restore it from draft status if necessary.

6. Once all work has been done and review has taken place please do not
add
features or make changes out of the scope of those requested by the
reviewer
(doing this just add delays as already reviewed code ends up having to
be
re-reviewed/it is hard to tell what is new etc!). Further, please avoid
rebasing your branch during the review process, as this causes the
context
of any comments made by reviewers to be lost. If conflicts occur during
review then they should be resolved by merging into the branch used for
   making the pull request.

Many thanks in advance for your cooperation!

-->

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven marked this pull request as ready for review December 16, 2025 09:52
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@thirtiseven thirtiseven requested a review from a team December 17, 2025 10:12
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 122-127 (link)

    logic: the detection logic will incorrectly flag allocations inside OOMRetryState class methods as covered, even though OOMRetryState isn't a retry framework method - it's just a helper class that happens to have "Retry" in its name

  2. sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 146-153 (link)

    logic: check-then-act race condition - between checking !loggedStacks.contains(stackKey) on line 146 and calling loggedStacks.add(stackKey) on line 147, another thread could add the same key, causing both threads to log duplicate entries to the file. is duplicate logging acceptable in this debugging tool, or should we ensure strictly unique entries?

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@thirtiseven
Copy link
Collaborator Author

Additional Comments (2)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 122-127 (link)
    logic: the detection logic will incorrectly flag allocations inside OOMRetryState class methods as covered, even though OOMRetryState isn't a retry framework method - it's just a helper class that happens to have "Retry" in its name
  2. sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala, line 146-153 (link)

Should be ok.

logic: check-then-act race condition - between checking !loggedStacks.contains(stackKey) on line 146 and calling loggedStacks.add(stackKey) on line 147, another thread could add the same key, causing both threads to log duplicate entries to the file. is duplicate logging acceptable in this debugging tool, or should we ensure strictly unique entries?

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

It's acceptable.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
)

// Default output file path for uncovered allocations - use /tmp for easy access
private val DEFAULT_OUTPUT_PATH = "/tmp/uncovered_allocations.csv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to use
Files.createTempFile("prefix", ".suffix"); or
String tempDir = System.getProperty("java.io.tmpdir");
instead of this path hardcode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

import AllocationKind._

// Package prefixes for spark-rapids code where we look for "Retry" in method names
private val SPARK_RAPIDS_PACKAGE_PREFIXES = Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have packages like
org.apache.spark.rapids,
org.apache.spark.shuffle.rapids
org.apache.spark.sql.catalyst.rapids,
...

Try some regrex expressions to do this filtering if possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

writeLock.synchronized {
if (!headerWritten) {
logWarning(s"Retry coverage tracking ACTIVE. Output: $DEFAULT_OUTPUT_PATH")
writeToFileInternal("kind,call_stack", append = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if the file already exists ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The records will be appended by design because we may want to run different things (integration tests, workloads) in a batch and collect all the uncovered cases in one file. A log has been added to warn that the file exists and that records will be appended.

// We look for any class or method containing "Retry" in its name within spark-rapids packages.
// This catches:
// - Core retry methods: withRetry, withRetryNoSplit, withRestoreOnRetry
// - Wrapper functions: concatenateAndMergeWithRetry, projectAndCloseWithRetrySingleBatch, etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if these are duplicate, since they already have withRetryNoSplit call inside.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to checking just retry methods and RmmRapidsRetryIterator.

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we need to upgrade copyrights (2026)

I like the idea, had one comment.

// Note: Scala often generates methods like `$anonfun$withRetryNoSplit$1`, so we match by
// substring rather than exact method name.
// Also exclude AllocationRetryCoverageTracker itself since it's always in the stack.
val hasCoverage = stackTrace.exists { element =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we, instead of pattern matching, use the RetryStateTracker? It presently tracks if the current thread is retrying, but could it also try when it is in a retry block? We could also gate that on whether we have enabled the feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a better solution, thanks. Updated.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a debugging feature to track memory allocations that are not covered by the retry framework (withRetry/withRetryNoSplit blocks), helping developers identify code paths that could potentially cause OOM errors without proper retry handling.

Key changes:

  • Adds AllocationRetryCoverageTracker to detect and log uncovered host and device memory allocations
  • Enhances RetryStateTracker with depth-based tracking to detect if code is executing inside a retry block
  • Integrates tracking hooks into HostAlloc for host memory and DeviceMemoryEventHandler for device memory

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala New utility class that tracks memory allocations not covered by retry framework, logs findings to CSV file
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala Adds retry block depth tracking to RetryStateTracker and wraps iterator next() with enter/exit calls
sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostAlloc.scala Integrates coverage tracker check after successful host memory allocation
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala Enables RMM debug mode when retry coverage tracking is enabled to get allocation callbacks
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala Implements onAllocated callback to check device memory allocation coverage
integration_tests/run_pyspark_from_build.sh Propagates SPARK_RAPIDS_RETRY_COVERAGE_TRACKING environment variable to test executors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala, line 585-587 (link)

    logic: clearCurThreadRetrying() called here may conflict with the same call at line 755 in parent class RmmRapidsRetryIterator.next(). The parent's next() already handles clearing retry state after completing all retry attempts. Calling it here in the finally block means it will execute even on successful paths, potentially clearing state prematurely or redundantly.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala, line 586-587 (link)

    logic: clearCurThreadRetrying() should be called before exitRetryBlock() only on exception path. On success path, the parent's next() already manages setCurThreadRetrying state. Calling clear here may incorrectly reset retry state for nested retry blocks.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants