Skip to content

fix: enable psutil fallback for memory monitoring when macmon is missing#1478

Merged
AlexCheema merged 5 commits intomainfrom
alexcheema/fix-memory-reporting
Feb 20, 2026
Merged

fix: enable psutil fallback for memory monitoring when macmon is missing#1478
AlexCheema merged 5 commits intomainfrom
alexcheema/fix-memory-reporting

Conversation

@AlexCheema
Copy link
Copy Markdown
Contributor

Summary

  • On macOS, memory monitoring relied exclusively on macmon — the psutil fallback was explicitly disabled (memory_poll_rate = None)
  • When macmon is not installed (e.g., mac-mini-2 through mac-mini-4 in our cluster), no memory data was reported, causing nodes to show 0GB memory in the cluster state
  • This blocked the scheduler from placing shards on those nodes since it had no memory data to work with
  • Fix: when macmon is not found on Darwin, fall back to psutil-based memory polling (memory_poll_rate = 1)

Root cause

InfoGatherer has two memory monitoring paths:

  1. macmon (Darwin-only): provides memory + GPU/CPU/power stats
  2. psutil (non-Darwin fallback): provides memory via MemoryUsage.from_psutil()

Line 378 disabled psutil on Darwin: memory_poll_rate = None if IS_DARWIN else 1
Line 389 only starts macmon if the binary exists: if shutil.which("macmon") is not None

If macmon is missing on Darwin, neither path runs — zero memory reported.

Test plan

  • Verify uv run basedpyright passes (0 errors confirmed)
  • Verify uv run ruff check passes (confirmed)
  • Verify uv run pytest src/exo/utils/info_gatherer/ passes (2/2 confirmed)
  • Deploy to cluster nodes without macmon and verify memory appears in /state

🤖 Generated with Claude Code

AlexCheema and others added 3 commits February 15, 2026 08:34
…ing on macOS

On Darwin, the psutil memory poller was disabled (memory_poll_rate=None),
relying entirely on macmon. When macmon is not installed, no memory data
was reported, causing nodes to show zero memory in the cluster state and
blocking shard placement.

Now falls back to psutil-based memory polling when macmon is not found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… formatting

The start_distributed_test.py script calls sys.exit() at module level,
crashing pytest collection. Add --ignore to pytest addopts to skip it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema enabled auto-merge (squash) February 16, 2026 13:34
@AlexCheema
Copy link
Copy Markdown
Contributor Author

Review summary (CI: all passing)

Minimal fix (+7/-1) for a real production issue:

  • On macOS, InfoGatherer disabled psutil memory polling (memory_poll_rate = None) assuming macmon would handle it
  • When macmon is not installed (e.g., fresh Mac Minis), neither memory path runs → nodes report 0GB → scheduler can't place shards
  • Fix: when macmon is not found on Darwin, sets memory_poll_rate = 1 to fall back to psutil
  • Also excludes tests/start_distributed_test.py from pytest collection (calls sys.exit() at import)

Clean, targeted fix. Good to merge.

Copy link
Copy Markdown
Contributor Author

@AlexCheema AlexCheema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #1478 — psutil fallback for memory monitoring when macmon is missing

Overall Assessment

Tiny but important resilience fix. When macmon is not installed on macOS (e.g., in EXO.app bundles that don't ship macmon, or fresh installs), memory monitoring silently stopped working. This fix falls back to psutil polling.

Analysis

Before:

if IS_DARWIN:
    if (macmon_path := shutil.which("macmon")) is not None:
        tg.start_soon(self._monitor_macmon, macmon_path)
    # else: nothing happens — no memory monitoring on macOS without macmon

After:

if IS_DARWIN:
    if (macmon_path := shutil.which("macmon")) is not None:
        tg.start_soon(self._monitor_macmon, macmon_path)
    else:
        logger.warning("macmon not found, falling back to psutil for memory monitoring")
        self.memory_poll_rate = 1

The _monitor_memory_usage method already handles psutil-based polling and is always started via tg.start_soon(self._monitor_memory_usage) on line ~395. The gate is self.memory_poll_rate — when None (the default on Darwin), the method returns immediately. Setting it to 1 activates the psutil fallback at 1-second intervals.

Strengths

  1. Minimal and correct: No new code paths needed — just enables an existing fallback. The _monitor_memory_usage already calls MemoryUsage.from_psutil() and handles exceptions gracefully.

  2. Good warning log: Makes it obvious in logs when the fallback is active, aiding debugging.

  3. 1-second poll rate matches Linux default: memory_poll_rate: float | None = None if IS_DARWIN else 1 — Linux already uses 1-second psutil polling. Consistent.

Observations

  1. Race condition on self.memory_poll_rate: The assignment self.memory_poll_rate = 1 happens during run(), but _monitor_memory_usage is started via tg.start_soon() on line ~395 which comes AFTER the macmon check. Since start_soon schedules the coroutine but doesn't run it immediately, and the memory_poll_rate assignment happens synchronously before any await, the _monitor_memory_usage coroutine will see the updated value. No race condition — correct as written.

  2. psutil vs macmon data quality: MemoryUsage.from_psutil() may report different values than macmon (psutil reports OS-level RSS, macmon reports Apple-specific memory pressure). Worth noting in comments that the fallback provides functional but potentially less accurate data. Not a blocker.

  3. pyproject.toml change: Same as PR #1466--ignore=tests/start_distributed_test.py seems unrelated. Should be a separate commit.

Verdict

Approve. Clean one-line fix that prevents silent memory monitoring failure on macOS without macmon.

🤖 Generated with Claude Code

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review: PR #1478 — fix: enable psutil fallback for memory monitoring when macmon is missing

Author: AlexCheema
Status: OPEN (auto-merge enabled, squash)
Changes: +7 / -1 across 2 files

Overview

On macOS, memory monitoring relied exclusively on macmon. When macmon is not
installed, memory_poll_rate was set to None (disabling psutil polling), so
neither monitoring path ran — nodes reported 0GB memory, blocking shard
placement. The fix: when macmon is not found on Darwin, set memory_poll_rate = 1
to enable psutil-based polling.

Correctness

PASS. The fix is correct and safe.

Root cause confirmed:

  • Line 378: memory_poll_rate = None if IS_DARWIN else 1
  • Line 389: macmon only starts if shutil.which("macmon") is not None
  • If macmon missing on Darwin → neither path runs → zero memory reported

The fix adds an else branch (lines 391-395) that sets self.memory_poll_rate = 1
when macmon is not found, enabling the psutil polling path.

Race Condition Analysis

NO RACE CONDITION. The assignment self.memory_poll_rate = 1 happens
synchronously inside run() BEFORE any tg.start_soon() task bodies execute.
anyio's start_soon() schedules tasks but does not run them inline. By the time
_monitor_memory_usage() checks memory_poll_rate, it is already set to 1.

Execution order:

  1. run() enters async with self._tg as tg
  2. macmon check fails → self.memory_poll_rate = 1 (synchronous)
  3. tg.start_soon(self._monitor_memory_usage) schedules the task
  4. _monitor_memory_usage() starts, sees memory_poll_rate = 1, enters polling loop

psutil on macOS

PASS. MemoryUsage.from_psutil() (src/exo/shared/types/profiling.py:31-40) uses
psutil.virtual_memory() and psutil.swap_memory(), both of which work correctly
on macOS. Returns proper RAM total/available and swap total/available.

What's lost without macmon: GPU/CPU temperature, usage percentages, and power
metrics (SystemPerformanceProfile). Memory data is fully covered by psutil.

pyproject.toml Change

NOTE: The diff includes an unrelated change adding --ignore=tests/start_distributed_test.py
to pytest addopts. This is a separate concern (excluding a distributed test that
causes pytest collection issues). Not harmful but ideally would be a separate commit.

Test Coverage

WARNING. No tests exist for the macmon-missing fallback path. The existing tests
in src/exo/utils/info_gatherer/tests/ only cover Thunderbolt parsing. A unit
test mocking shutil.which("macmon") to return None and verifying memory_poll_rate
gets set to 1 would add confidence, but is not blocking for this fix.

Edge Cases

  • macmon installed but broken/crashing: NOT addressed by this PR (separate issue,
    macmon path would still be taken)
  • Non-Darwin platforms: Unaffected — they already use psutil (memory_poll_rate = 1)
  • macmon installed: Unaffected — the if branch is taken, else never executes

Nits

  • The pyproject.toml change is unrelated and should ideally be a separate commit.

Verdict

LGTM. Clean, minimal, correct fix for a critical issue (nodes invisible to the
scheduler due to zero memory). No race conditions, psutil works on macOS, and
the fallback preserves the priority of macmon when available.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review — PR #1478: enable psutil fallback for memory monitoring when macmon is missing

CI: ALL PASSING (aarch64-darwin, x86_64-linux, aarch64-linux)

Overview

Minimal fix — +7/-1 across 2 files.

The bug: On macOS, InfoGatherer defaults memory_poll_rate = None, assuming macmon handles memory reporting. When macmon is not installed (fresh Mac Minis, non-Homebrew setups), neither memory path runs — nodes report no memory, placement fails silently.

The fix: When macmon is not found, set self.memory_poll_rate = 1 so _monitor_memory_usage (psutil-based) kicks in. Also adds --ignore=tests/start_distributed_test.py to pytest addopts since that file is a standalone CLI script that calls sys.exit() at module level.

Assessment

Both changes are correct:

  1. psutil fallback_monitor_memory_usage() returns immediately when memory_poll_rate is None (line 463). Setting it to 1 enables the existing MemoryUsage.from_psutil() path. No race condition: the assignment happens synchronously in run() before any start_soon task bodies execute.

  2. pytest ignorestart_distributed_test.py parses sys.argv and calls sys.exit() at module level. Not a pytest test — correctly excluded.

Nit

The pytest change is unrelated to the memory fix. Ideally separate commits, but not blocking.

Verdict

LGTM. Clean fix for a real production issue. No concerns.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Code Review: PR #1478 — fix: enable psutil fallback for memory monitoring when macmon is missing

Summary

When macmon isn't installed on macOS, no memory data was reported because psutil polling was explicitly disabled on Darwin. Fix: fall back to psutil when macmon is missing.

Review

Root cause is clear:

  • memory_poll_rate = None if IS_DARWIN else 1 disabled psutil on Darwin
  • macmon check only starts if binary exists
  • Result: no memory monitoring if macmon is missing on macOS

Fix is minimal:

if (macmon_path := shutil.which("macmon")) is not None:
    tg.start_soon(self._monitor_macmon, macmon_path)
else:
    logger.warning("macmon not found, falling back to psutil for memory monitoring")
    self.memory_poll_rate = 1

Sets memory_poll_rate = 1 when macmon is missing, enabling the psutil fallback. ✅

Issues

1. pyproject.toml change is unrelated

-addopts = "-m 'not slow'"
+addopts = "-m 'not slow' --ignore=tests/start_distributed_test.py"

This adds --ignore=tests/start_distributed_test.py to pytest defaults. This is unrelated to the macmon fix and should be in a separate PR. What's the reason for ignoring this test file?

2. self.memory_poll_rate is set after construction
The memory_poll_rate was set in __init__ to None for Darwin. This fix mutates it at runtime in run(). It works because the psutil polling loop checks self.memory_poll_rate dynamically, but it's a somewhat fragile pattern — the timing matters.

Verdict

Good fix for a real issue. Clean and minimal. The unrelated pyproject.toml change should be split out.

LGTM.

@AlexCheema AlexCheema merged commit e32b649 into main Feb 20, 2026
6 checks passed
@AlexCheema AlexCheema deleted the alexcheema/fix-memory-reporting branch February 20, 2026 13:12
adurham pushed a commit to adurham/exo that referenced this pull request Feb 27, 2026
…ing (exo-explore#1478)

## Summary
- On macOS, memory monitoring relied exclusively on `macmon` — the
psutil fallback was explicitly disabled (`memory_poll_rate = None`)
- When `macmon` is not installed (e.g., mac-mini-2 through mac-mini-4 in
our cluster), **no memory data was reported**, causing nodes to show 0GB
memory in the cluster state
- This blocked the scheduler from placing shards on those nodes since it
had no memory data to work with
- Fix: when `macmon` is not found on Darwin, fall back to psutil-based
memory polling (`memory_poll_rate = 1`)

## Root cause
`InfoGatherer` has two memory monitoring paths:
1. `macmon` (Darwin-only): provides memory + GPU/CPU/power stats
2. `psutil` (non-Darwin fallback): provides memory via
`MemoryUsage.from_psutil()`

Line 378 disabled psutil on Darwin: `memory_poll_rate = None if
IS_DARWIN else 1`
Line 389 only starts macmon if the binary exists: `if
shutil.which("macmon") is not None`

If macmon is missing on Darwin, **neither path runs** — zero memory
reported.

## Test plan
- [ ] Verify `uv run basedpyright` passes (0 errors confirmed)
- [ ] Verify `uv run ruff check` passes (confirmed)
- [ ] Verify `uv run pytest src/exo/utils/info_gatherer/` passes (2/2
confirmed)
- [ ] Deploy to cluster nodes without macmon and verify memory appears
in `/state`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants