RFC for adding memory monitoring to debug OOM#2772
RFC for adding memory monitoring to debug OOM#2772
Conversation
geomin12
left a comment
There was a problem hiding this comment.
overall looks great! just a few housekeeping comments
geomin12
left a comment
There was a problem hiding this comment.
lgtm! let's get Shiraz to do a pass on this as well
|
@dezhiAmd are we generating any report post this operation ? if yes, can you pls share an example ? First pass through RFC i have some more questions, i will set up a meeting tomorrow to go over and understand. that will be better i believe. thanks! |
| The solution consists of four main components: | ||
|
|
||
| ``` | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ GitHub Actions Workflow │ | ||
| │ (ci.yml, ci_linux.yml, ci_windows.yml, build_*.yml) │ | ||
| └───────────────────┬─────────────────────────────────────────┘ | ||
| │ | ||
| ├─ Start: start_memory_monitor.sh/.ps1 |
There was a problem hiding this comment.
As mentioned multiple times during code reviews, these workflows are not an exhaustive list and we need a solution that scales to multiple build and release workflows, perhaps across multiple repositories. This current design requires deep changes to many files and substantial plumbing. Let's spend more time exploring what we can do at the machine/runner/cloud project level.
There was a problem hiding this comment.
I believe this solution is scalable.
Please refer to the link where the component test is enabled for memory monitoring.
Even when leveraging an existing tool, some modifications to the current workflow are still required to capture the necessary metrics.
To support multiple repositories, the monitoring script can be added to each repository, followed by updates to the respective workflow files.
Key advantage: This approach provides a unified solution with full control, regardless of whether the runner is on-premises or hosted on cloud VMs.
There was a problem hiding this comment.
Workflows are moving towards this style: https://github.com/ROCm/TheRock/blob/main/.github/workflows/multi_arch_build_portable_linux.yml, with 7+ setup/build/report jobs (per platform). Any monitoring will need to handle multiple build stages and will need to integrate with those sorts of workflows with minimal plumbing.
Some ideas:
- Workflow-level environment variable loaded from inputs, read that env var during a script like
configure_stage.py - Instrument on the runners themselves
- Reusable workflow for "setup ccache", "runner health status", "enable monitoring", etc. (similar to what pytorch does for workflow init)
|
@dezhiAmd I have spent sometime thinking more on this one and as discussed offline last week here is my update and request.
|
7e9ba03 to
3077ce8
Compare
Signed-off-by: dezhliao <dezhliao@amd.com>
Signed-off-by: dezhliao <dezhliao@amd.com>
d881145 to
455d0dd
Compare
Signed-off-by: dezhliao <dezhliao@amd.com>
ScottTodd
left a comment
There was a problem hiding this comment.
Marking as 'reviewed' to drop this from my queue (need other reviewers to be on top of PRs like this too, I can't be the only one reviewing after a batch of comments)
|
i am taking over this PR once i get some cycles |
|
Closing for #4050 |

Motivation
Self-hosted GitHub runners executing TheRock builds have been experiencing out-of-memory errors, causing build failures and CI instability. Without detailed memory usage tracking across different build phases, it is difficult to:
Technical Details
Test Plan
Test Result
Submission Checklist