fix: enable psutil fallback for memory monitoring when macmon is missing#1478
fix: enable psutil fallback for memory monitoring when macmon is missing#1478AlexCheema merged 5 commits intomainfrom
Conversation
…ing on macOS On Darwin, the psutil memory poller was disabled (memory_poll_rate=None), relying entirely on macmon. When macmon is not installed, no memory data was reported, causing nodes to show zero memory in the cluster state and blocking shard placement. Now falls back to psutil-based memory polling when macmon is not found. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… formatting The start_distributed_test.py script calls sys.exit() at module level, crashing pytest collection. Add --ignore to pytest addopts to skip it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Review summary (CI: all passing) Minimal fix (+7/-1) for a real production issue:
Clean, targeted fix. Good to merge. |
AlexCheema
left a comment
There was a problem hiding this comment.
Code Review: PR #1478 — psutil fallback for memory monitoring when macmon is missing
Overall Assessment
Tiny but important resilience fix. When macmon is not installed on macOS (e.g., in EXO.app bundles that don't ship macmon, or fresh installs), memory monitoring silently stopped working. This fix falls back to psutil polling.
Analysis
Before:
if IS_DARWIN:
if (macmon_path := shutil.which("macmon")) is not None:
tg.start_soon(self._monitor_macmon, macmon_path)
# else: nothing happens — no memory monitoring on macOS without macmonAfter:
if IS_DARWIN:
if (macmon_path := shutil.which("macmon")) is not None:
tg.start_soon(self._monitor_macmon, macmon_path)
else:
logger.warning("macmon not found, falling back to psutil for memory monitoring")
self.memory_poll_rate = 1The _monitor_memory_usage method already handles psutil-based polling and is always started via tg.start_soon(self._monitor_memory_usage) on line ~395. The gate is self.memory_poll_rate — when None (the default on Darwin), the method returns immediately. Setting it to 1 activates the psutil fallback at 1-second intervals.
Strengths
-
Minimal and correct: No new code paths needed — just enables an existing fallback. The
_monitor_memory_usagealready callsMemoryUsage.from_psutil()and handles exceptions gracefully. -
Good warning log: Makes it obvious in logs when the fallback is active, aiding debugging.
-
1-second poll rate matches Linux default:
memory_poll_rate: float | None = None if IS_DARWIN else 1— Linux already uses 1-second psutil polling. Consistent.
Observations
-
Race condition on
self.memory_poll_rate: The assignmentself.memory_poll_rate = 1happens duringrun(), but_monitor_memory_usageis started viatg.start_soon()on line ~395 which comes AFTER the macmon check. Sincestart_soonschedules the coroutine but doesn't run it immediately, and thememory_poll_rateassignment happens synchronously before anyawait, the_monitor_memory_usagecoroutine will see the updated value. No race condition — correct as written. -
psutil vs macmon data quality:
MemoryUsage.from_psutil()may report different values than macmon (psutil reports OS-level RSS, macmon reports Apple-specific memory pressure). Worth noting in comments that the fallback provides functional but potentially less accurate data. Not a blocker. -
pyproject.tomlchange: Same as PR #1466 —--ignore=tests/start_distributed_test.pyseems unrelated. Should be a separate commit.
Verdict
Approve. Clean one-line fix that prevents silent memory monitoring failure on macOS without macmon.
🤖 Generated with Claude Code
Code Review: PR #1478 — fix: enable psutil fallback for memory monitoring when macmon is missingAuthor: AlexCheema OverviewOn macOS, memory monitoring relied exclusively on macmon. When macmon is not CorrectnessPASS. The fix is correct and safe. Root cause confirmed:
The fix adds an else branch (lines 391-395) that sets self.memory_poll_rate = 1 Race Condition AnalysisNO RACE CONDITION. The assignment self.memory_poll_rate = 1 happens Execution order:
psutil on macOSPASS. MemoryUsage.from_psutil() (src/exo/shared/types/profiling.py:31-40) uses What's lost without macmon: GPU/CPU temperature, usage percentages, and power pyproject.toml ChangeNOTE: The diff includes an unrelated change adding --ignore=tests/start_distributed_test.py Test CoverageWARNING. No tests exist for the macmon-missing fallback path. The existing tests Edge Cases
Nits
VerdictLGTM. Clean, minimal, correct fix for a critical issue (nodes invisible to the |
Code Review — PR #1478: enable psutil fallback for memory monitoring when macmon is missingCI: ALL PASSING (aarch64-darwin, x86_64-linux, aarch64-linux) OverviewMinimal fix — +7/-1 across 2 files. The bug: On macOS, The fix: When AssessmentBoth changes are correct:
NitThe pytest change is unrelated to the memory fix. Ideally separate commits, but not blocking. VerdictLGTM. Clean fix for a real production issue. No concerns. |
Code Review: PR #1478 — fix: enable psutil fallback for memory monitoring when macmon is missingSummaryWhen ReviewRoot cause is clear:
Fix is minimal: if (macmon_path := shutil.which("macmon")) is not None:
tg.start_soon(self._monitor_macmon, macmon_path)
else:
logger.warning("macmon not found, falling back to psutil for memory monitoring")
self.memory_poll_rate = 1Sets Issues1. This adds 2. VerdictGood fix for a real issue. Clean and minimal. The unrelated LGTM. |
…ing (exo-explore#1478) ## Summary - On macOS, memory monitoring relied exclusively on `macmon` — the psutil fallback was explicitly disabled (`memory_poll_rate = None`) - When `macmon` is not installed (e.g., mac-mini-2 through mac-mini-4 in our cluster), **no memory data was reported**, causing nodes to show 0GB memory in the cluster state - This blocked the scheduler from placing shards on those nodes since it had no memory data to work with - Fix: when `macmon` is not found on Darwin, fall back to psutil-based memory polling (`memory_poll_rate = 1`) ## Root cause `InfoGatherer` has two memory monitoring paths: 1. `macmon` (Darwin-only): provides memory + GPU/CPU/power stats 2. `psutil` (non-Darwin fallback): provides memory via `MemoryUsage.from_psutil()` Line 378 disabled psutil on Darwin: `memory_poll_rate = None if IS_DARWIN else 1` Line 389 only starts macmon if the binary exists: `if shutil.which("macmon") is not None` If macmon is missing on Darwin, **neither path runs** — zero memory reported. ## Test plan - [ ] Verify `uv run basedpyright` passes (0 errors confirmed) - [ ] Verify `uv run ruff check` passes (confirmed) - [ ] Verify `uv run pytest src/exo/utils/info_gatherer/` passes (2/2 confirmed) - [ ] Deploy to cluster nodes without macmon and verify memory appears in `/state` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>
Summary
macmon— the psutil fallback was explicitly disabled (memory_poll_rate = None)macmonis not installed (e.g., mac-mini-2 through mac-mini-4 in our cluster), no memory data was reported, causing nodes to show 0GB memory in the cluster statemacmonis not found on Darwin, fall back to psutil-based memory polling (memory_poll_rate = 1)Root cause
InfoGathererhas two memory monitoring paths:macmon(Darwin-only): provides memory + GPU/CPU/power statspsutil(non-Darwin fallback): provides memory viaMemoryUsage.from_psutil()Line 378 disabled psutil on Darwin:
memory_poll_rate = None if IS_DARWIN else 1Line 389 only starts macmon if the binary exists:
if shutil.which("macmon") is not NoneIf macmon is missing on Darwin, neither path runs — zero memory reported.
Test plan
uv run basedpyrightpasses (0 errors confirmed)uv run ruff checkpasses (confirmed)uv run pytest src/exo/utils/info_gatherer/passes (2/2 confirmed)/state🤖 Generated with Claude Code