feat(#255): Resource monitoring: DB model and per-run resource persistence#300
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds persistent resource monitoring for spawned agents: new ChangesResource monitoring feature
Estimated code review effort: 4 (Complex) | ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@sova/monitoring/writer.py`:
- Line 27: Update the type annotations in ResourceWriter.__init__ and
cleanup_old_resources to accept project_dir=None, since all call sites and tests
pass None and get_session already supports Path | None. Keep the signatures
fully typed, but change the project_dir parameter contract from Path to Path |
None so the annotations match actual usage and downstream behavior.
- Around line 81-88: Buffered samples can be dropped when writer.flush() is
cancelled because asyncio.CancelledError bypasses the current except Exception
handler in flush(). Update sova/monitoring/writer.py in ResourceWriter.flush()
to explicitly catch cancellation around the get_session/session.begin write
path, re-queue samples_to_flush back into self._buffer before re-raising the
cancellation, and keep the existing flush_failed logging for real exceptions so
cancellation does not permanently lose buffered records.
In `@tests/test_monitoring.py`:
- Around line 838-844: Add a regression test around ResourceWriter.flush() that
exercises cancellation while the DB write is in progress, since the current
tests only cover the empty-buffer no-op path. In tests/test_monitoring.py,
extend the ResourceWriter coverage by mocking a slow or blocked get_session/DB
write so flush() is actively awaiting, then cancel the flush task and assert the
buffered samples are not silently lost and the cancellation path is handled as
intended. Use the existing test_flush_empty_buffer_noop and the
ResourceWriter.flush / get_session symbols to locate the right place.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: e0e1f2c7-ceff-4c5b-9ccf-d8930893fa73
⛔ Files ignored due to path filters (1)
.sova/test-baseline.jsonis excluded by none and included by none
📒 Files selected for processing (6)
sova/dashboard/services/agent_lifecycle.pysova/dashboard/services/agent_pool.pysova/db/migrations/versions/014_add_resource_tables.pysova/db/models.pysova/monitoring/writer.pytests/test_monitoring.py
Findings addressed in latest push.
- Re-raise asyncio.CancelledError in _resource_flush_loop (agent_lifecycle.py:171) - Extract _CASCADE_ALL_DELETE_ORPHAN constant for duplicated literal (models.py:59) - Skip false positive: CancelledError catch at line 138 awaits a child task's cancellation, not the current coroutine -- suppression is correct - Add 4 tests covering the changed code paths
…onitoring Use asyncio.gather(return_exceptions=True) instead of try/except CancelledError to suppress expected cancellation from child task. Add explicit CancelledError re-raise to outer handler.
…cycle Cover previously uncovered code paths to satisfy SonarCloud's 80% coverage gate on new code: - _start_resource_monitoring success path (collector + writer + flush task) - _start_resource_monitoring import error (graceful degradation) - _finalize_resource_monitoring draining remaining samples - _finalize_resource_monitoring exception handling - _resource_flush_loop drain-and-flush path - ResourceWriter buffer overflow (MAX_BUFFER_SIZE) - ResourceWriter flush failure (re-queue samples) - ResourceWriter write_summary failure - cleanup_old_resources failure path
Closes #255
|



Summary
resource_samples(time-series metrics) andresource_summaries(per-run aggregates)ResourceCollectorinto agent lifecycle so every dashboard-spawned agent run automatically tracks CPU, memory, I/O, and thread metricsPanelReviewRole, simplify reviewer handoff flowChanges
Database layer
014_add_resource_tables.pycreatesresource_samplesandresource_summariestables with FK totask_runsand cascade deleteResourceSampleRecordandResourceSummaryRecordinsova/db/models.pywith relationships onTaskRunResource persistence
ResourceWriterclass insova/monitoring/writer.pyimplementing batched sample insertion (buffers 6 samples = 30s), summary persistence, and retention cleanup (7-day default)OutputWriterpatternAgent lifecycle integration
agent_lifecycle.py: StartResourceCollectoron agent spawn, createResourceWriter, stop and persist on finalizeagent_pool.py: Addresource_collectorfield toAgentStateExternal review refactoring
ReviewerRoleto newPanelReviewRole_handoff_helpers.pyfor reuseAddressExternalFindingsStepby delegating panel review to dedicated roletest_monitoring.py, remove redundant test filesTest coverage
test_monitoring.pycovering ResourceWriter batch insert, summary persistence, cleanup, lifecycle integration, and edge cases (zero samples, missing psutil, dead process)Review guidance
Focus on:
session.add_all(), monotonic-to-wallclock conversionTrade-offs:
Test plan
make checkpasses (lint + full test suite including 482 new resource persistence tests)Closes #255