Skip to content

RFC for adding memory monitoring to debug OOM#2772

Closed
dezhiAmd wants to merge 3 commits intoROCm:mainfrom
dezhiAmd:RFC_memory_log
Closed

RFC for adding memory monitoring to debug OOM#2772
dezhiAmd wants to merge 3 commits intoROCm:mainfrom
dezhiAmd:RFC_memory_log

Conversation

@dezhiAmd
Copy link
Copy Markdown
Contributor

@dezhiAmd dezhiAmd commented Jan 5, 2026

Motivation

Self-hosted GitHub runners executing TheRock builds have been experiencing out-of-memory errors, causing build failures and CI instability. Without detailed memory usage tracking across different build phases, it is difficult to:

  1. Identify the root cause of OOM failures
  2. Determine which build phases consume the most memory
  3. Optimize resource allocation and parallel job configurations
  4. Proactively detect memory pressure before failures occur
  5. Analyze historical trends in memory consumption

Technical Details

Test Plan

Test Result

Submission Checklist

@dezhiAmd dezhiAmd changed the title Rfc memory log RFC for adding memory monitoring to debug OOM Jan 5, 2026
@dezhiAmd dezhiAmd marked this pull request as ready for review January 5, 2026 22:17
Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks great! just a few housekeeping comments

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
@dezhiAmd dezhiAmd requested a review from geomin12 January 7, 2026 18:08
Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! let's get Shiraz to do a pass on this as well

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
@amd-shiraz
Copy link
Copy Markdown
Contributor

@dezhiAmd are we generating any report post this operation ? if yes, can you pls share an example ? First pass through RFC i have some more questions, i will set up a meeting tomorrow to go over and understand. that will be better i believe. thanks!

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment on lines +43 to +51
The solution consists of four main components:

```
┌─────────────────────────────────────────────────────────────┐
│ GitHub Actions Workflow │
│ (ci.yml, ci_linux.yml, ci_windows.yml, build_*.yml) │
└───────────────────┬─────────────────────────────────────────┘
├─ Start: start_memory_monitor.sh/.ps1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned multiple times during code reviews, these workflows are not an exhaustive list and we need a solution that scales to multiple build and release workflows, perhaps across multiple repositories. This current design requires deep changes to many files and substantial plumbing. Let's spend more time exploring what we can do at the machine/runner/cloud project level.

Copy link
Copy Markdown
Contributor Author

@dezhiAmd dezhiAmd Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this solution is scalable.
Please refer to the link where the component test is enabled for memory monitoring.
Even when leveraging an existing tool, some modifications to the current workflow are still required to capture the necessary metrics.
To support multiple repositories, the monitoring script can be added to each repository, followed by updates to the respective workflow files.
Key advantage: This approach provides a unified solution with full control, regardless of whether the runner is on-premises or hosted on cloud VMs.

CC @amd-shiraz @geomin12 @amd-justchen

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflows are moving towards this style: https://github.com/ROCm/TheRock/blob/main/.github/workflows/multi_arch_build_portable_linux.yml, with 7+ setup/build/report jobs (per platform). Any monitoring will need to handle multiple build stages and will need to integrate with those sorts of workflows with minimal plumbing.

Some ideas:

  • Workflow-level environment variable loaded from inputs, read that env var during a script like configure_stage.py
  • Instrument on the runners themselves
  • Reusable workflow for "setup ccache", "runner health status", "enable monitoring", etc. (similar to what pytorch does for workflow init)

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md Outdated
@stellaraccident stellaraccident removed their request for review January 10, 2026 03:48
@amd-shiraz
Copy link
Copy Markdown
Contributor

@dezhiAmd I have spent sometime thinking more on this one and as discussed offline last week here is my update and request.

  1. real time tracking this metric is fine but perhaps integrating to the build is not.
  2. Taking a step back and thinking through, if we are tracking overall runner health the way it should be then we don't need to introduce something as part of the build to track it as well.
  3. this tool itself will not prevent the build from failing due to lets say OOM issues etc. but will help debug and triage those failures after the fact.
  4. Mainly because of step 3, i would encourage us to use this tool as something we run as an external debug tool as and when needed Vs integrating as part of the build itself. eg: when the build failed due to OOM issue and we want to find out which step of the build caused it, we make use of this tool to be ran separately making use of the build logs + timestamps etc. to check.
  5. nit: for RFCs like these that are more related to underlying infra, i would encourage us to update on TheRock-infra repo https://github.com/ROCm/TheRock-Infra/tree/main/docs/RFCs .
    Hope that sounds helpful and we can make some modifications accordingly. overall good stuff.
    cc @ScottTodd @geomin12

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md
@marbre
Copy link
Copy Markdown
Member

marbre commented Jan 13, 2026

We had an issue with a history rewrite which confused the GH UI and the commits it is showing on your PR. Please make sure to update your branch e.g. in the GH UI via

image

Afterwards you need update your local branch via git pull.

Alternative git rebase solution

Alternatively you can manually update your branch locally by following the below instructions:

git checkout <yourbranch>
git fetch origin main
git rebase origin/main
git push --force-with-lease origin <yourbranch>

Signed-off-by: dezhliao <dezhliao@amd.com>
Signed-off-by: dezhliao <dezhliao@amd.com>
Signed-off-by: dezhliao <dezhliao@amd.com>
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as 'reviewed' to drop this from my queue (need other reviewers to be on top of PRs like this too, I can't be the only one reviewing after a batch of comments)

@geomin12
Copy link
Copy Markdown
Contributor

i am taking over this PR once i get some cycles

@geomin12
Copy link
Copy Markdown
Contributor

Closing for #4050

@geomin12 geomin12 closed this Mar 18, 2026
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants