You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
macOS manages unified memory dynamically and distinguishes between pageable and wired (pinned) memory. Metal GPU buffers used by llama-server are wired memory. Under memory pressure, macOS's wired collector kernel mechanism can evict GPU buffers, causing severe performance degradation or process crashes — with no warning to the Metal agent.
The agent currently has no visibility into memory pressure and cannot react to it. A process that was healthy at startup can silently degrade as other applications compete for unified memory.
No monitoring of system memory pressure or Metal buffer eviction
No graceful degradation — if macOS kills a llama-server process, the agent doesn't know until the next health check (which doesn't exist continuously today)
No memory-related status information surfaced to Kubernetes
Proposed Work
Memory Pressure Monitoring
Periodically query system memory stats (vm_stat / host_statistics64) during agent runtime
Background
macOS manages unified memory dynamically and distinguishes between pageable and wired (pinned) memory. Metal GPU buffers used by
llama-serverare wired memory. Under memory pressure, macOS's wired collector kernel mechanism can evict GPU buffers, causing severe performance degradation or process crashes — with no warning to the Metal agent.The agent currently has no visibility into memory pressure and cannot react to it. A process that was healthy at startup can silently degrade as other applications compete for unified memory.
This issue was informed by research on vllm-metal's memory allocation strategy which highlights the wired collector problem on Apple Silicon.
Implementation Plan
This is being implemented in multiple PRs to keep reviews manageable:
PR A: Memory pressure detection + metrics (Phase 1)
MemoryProviderwithWiredMemory()andProcessRSS()MemoryPressureLeveltype (Normal/Warning/Critical)MemoryWatchdoggoroutine (configurable interval, thresholds)--memory-watchdog-interval,--memory-pressure-warning,--memory-pressure-critical,--eviction-enabled)PR B: Proactive eviction + status reporting (Phase 2)
onPressurecallback (lowest-priority process evicted first)MemoryPressurecondition to InferenceService status/memstatsendpoint to health serverCurrent State
Healthyflag is set once at startup and never re-evaluated (tracked in Metal agent: health checks, backpressure, and observability #171)llama-serverprocess, the agent doesn't know until the next health check (which doesn't exist continuously today)Proposed Work
Memory Pressure Monitoring
vm_stat/host_statistics64) during agent runtimellama-serverprocesses are consuming more memory than expectedProactive Protection
MemoryPressureconditionStatus Reporting
/metricsendpoint (ties into Metal agent: health checks, backpressure, and observability #171)Documentation
sysctlsettings)References
pkg/agent/agent.go— agent main looppkg/agent/executor.go— process managementpkg/agent/watcher.go— 5-second polling loop