RFC for adding memory monitoring to debug OOM by dezhiAmd · Pull Request #2772 · ROCm/TheRock

dezhiAmd · 2026-01-05T22:12:17Z

Motivation

Self-hosted GitHub runners executing TheRock builds have been experiencing out-of-memory errors, causing build failures and CI instability. Without detailed memory usage tracking across different build phases, it is difficult to:

Identify the root cause of OOM failures
Determine which build phases consume the most memory
Optimize resource allocation and parallel job configurations
Proactively detect memory pressure before failures occur
Analyze historical trends in memory consumption

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

geomin12

overall looks great! just a few housekeeping comments

geomin12

lgtm! let's get Shiraz to do a pass on this as well

amd-shiraz · 2026-01-07T21:19:34Z

@dezhiAmd are we generating any report post this operation ? if yes, can you pls share an example ? First pass through RFC i have some more questions, i will set up a meeting tomorrow to go over and understand. that will be better i believe. thanks!

ScottTodd · 2026-01-07T21:35:57Z

+The solution consists of four main components:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   GitHub Actions Workflow                    │
+│  (ci.yml, ci_linux.yml, ci_windows.yml, build_*.yml)       │
+└───────────────────┬─────────────────────────────────────────┘
+                    │
+                    ├─ Start: start_memory_monitor.sh/.ps1


As mentioned multiple times during code reviews, these workflows are not an exhaustive list and we need a solution that scales to multiple build and release workflows, perhaps across multiple repositories. This current design requires deep changes to many files and substantial plumbing. Let's spend more time exploring what we can do at the machine/runner/cloud project level.

I believe this solution is scalable.
Please refer to the link where the component test is enabled for memory monitoring.
Even when leveraging an existing tool, some modifications to the current workflow are still required to capture the necessary metrics.
To support multiple repositories, the monitoring script can be added to each repository, followed by updates to the respective workflow files.
Key advantage: This approach provides a unified solution with full control, regardless of whether the runner is on-premises or hosted on cloud VMs.

CC @amd-shiraz @geomin12 @amd-justchen

Workflows are moving towards this style: https://github.com/ROCm/TheRock/blob/main/.github/workflows/multi_arch_build_portable_linux.yml, with 7+ setup/build/report jobs (per platform). Any monitoring will need to handle multiple build stages and will need to integrate with those sorts of workflows with minimal plumbing.

Some ideas:

Workflow-level environment variable loaded from inputs, read that env var during a script like configure_stage.py

Instrument on the runners themselves

Reusable workflow for "setup ccache", "runner health status", "enable monitoring", etc. (similar to what pytorch does for workflow init)

amd-shiraz · 2026-01-12T19:40:04Z

@dezhiAmd I have spent sometime thinking more on this one and as discussed offline last week here is my update and request.

real time tracking this metric is fine but perhaps integrating to the build is not.
Taking a step back and thinking through, if we are tracking overall runner health the way it should be then we don't need to introduce something as part of the build to track it as well.
this tool itself will not prevent the build from failing due to lets say OOM issues etc. but will help debug and triage those failures after the fact.
Mainly because of step 3, i would encourage us to use this tool as something we run as an external debug tool as and when needed Vs integrating as part of the build itself. eg: when the build failed due to OOM issue and we want to find out which step of the build caused it, we make use of this tool to be ran separately making use of the build logs + timestamps etc. to check.
nit: for RFCs like these that are more related to underlying infra, i would encourage us to update on TheRock-infra repo https://github.com/ROCm/TheRock-Infra/tree/main/docs/RFCs .
Hope that sounds helpful and we can make some modifications accordingly. overall good stuff.
cc @ScottTodd @geomin12

marbre · 2026-01-13T18:01:45Z

We had an issue with a history rewrite which confused the GH UI and the commits it is showing on your PR. Please make sure to update your branch e.g. in the GH UI via

Afterwards you need update your local branch via git pull.

Alternative git rebase solution

Alternatively you can manually update your branch locally by following the below instructions:

git checkout <yourbranch>
git fetch origin main
git rebase origin/main
git push --force-with-lease origin <yourbranch>

Signed-off-by: dezhliao <dezhliao@amd.com>

ScottTodd

Marking as 'reviewed' to drop this from my queue (need other reviewers to be on top of PRs like this too, I can't be the only one reviewing after a batch of comments)

geomin12 · 2026-02-25T16:59:03Z

i am taking over this PR once i get some cycles

geomin12 · 2026-03-18T21:29:19Z

Closing for #4050

github-project-automation Bot added this to TheRock Triage Jan 5, 2026

github-project-automation Bot moved this to TODO in TheRock Triage Jan 5, 2026

dezhiAmd changed the title ~~Rfc memory log~~ RFC for adding memory monitoring to debug OOM Jan 5, 2026

dezhiAmd marked this pull request as ready for review January 5, 2026 22:17

dezhiAmd requested review from ScottTodd, amd-shiraz and stellaraccident January 5, 2026 22:18

dezhiAmd mentioned this pull request Jan 5, 2026

Add memory logging scripts and integrate into workflows #2453

Closed

1 task

dezhiAmd requested review from amd-justchen and geomin12 January 6, 2026 02:28

geomin12 reviewed Jan 6, 2026

View reviewed changes

dezhiAmd requested a review from geomin12 January 7, 2026 18:08

geomin12 approved these changes Jan 7, 2026

View reviewed changes

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md

ScottTodd reviewed Jan 7, 2026

View reviewed changes

stellaraccident removed their request for review January 10, 2026 03:48

amd-shiraz reviewed Jan 12, 2026

View reviewed changes

Comment thread docs/rfcs/RFC0009-Memory-Monitoring-System.md

subodh-dubey-amd force-pushed the main branch from 7e9ba03 to 3077ce8 Compare January 13, 2026 12:08

dezhiAmd added 2 commits January 13, 2026 11:29

add RFC for memory monitoring

4f7a18f

Signed-off-by: dezhliao <dezhliao@amd.com>

Fix pre-commit format

455d0dd

Signed-off-by: dezhliao <dezhliao@amd.com>

dezhiAmd force-pushed the RFC_memory_log branch from d881145 to 455d0dd Compare January 13, 2026 19:30

Always enabling memory monitoring

f0f39e8

Signed-off-by: dezhliao <dezhliao@amd.com>

dezhiAmd requested review from ScottTodd, amd-shiraz and geomin12 January 13, 2026 23:55

ScottTodd reviewed Feb 10, 2026

View reviewed changes

geomin12 closed this Mar 18, 2026

github-project-automation Bot moved this from TODO to Done in TheRock Triage Mar 18, 2026

Conversation

dezhiAmd commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

geomin12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geomin12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amd-shiraz commented Jan 7, 2026

Uh oh!

Uh oh!

ScottTodd Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

dezhiAmd Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottTodd Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amd-shiraz commented Jan 12, 2026

Uh oh!

Uh oh!

marbre commented Jan 13, 2026

Uh oh!

ScottTodd left a comment

Choose a reason for hiding this comment

Uh oh!

geomin12 commented Feb 25, 2026

Uh oh!

geomin12 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dezhiAmd commented Jan 5, 2026 •

edited

Loading

dezhiAmd Jan 13, 2026 •

edited

Loading