feat(#255): Resource monitoring: DB model and per-run resource persistence by xsovad06 · Pull Request #300 · xsovad06/sova

xsovad06 · 2026-07-02T17:21:40Z

Summary

Add database persistence for resource monitoring data with two new tables: resource_samples (time-series metrics) and resource_summaries (per-run aggregates)
Wire ResourceCollector into agent lifecycle so every dashboard-spawned agent run automatically tracks CPU, memory, I/O, and thread metrics
Refactor external review handling: extract panel review logic to dedicated PanelReviewRole, simplify reviewer handoff flow

Changes

Database layer

New migration 014_add_resource_tables.py creates resource_samples and resource_summaries tables with FK to task_runs and cascade delete
New ORM models ResourceSampleRecord and ResourceSummaryRecord in sova/db/models.py with relationships on TaskRun

Resource persistence

New ResourceWriter class in sova/monitoring/writer.py implementing batched sample insertion (buffers 6 samples = 30s), summary persistence, and retention cleanup (7-day default)
Converts monotonic timestamps to wall-clock for DB storage
Non-fatal error handling matches OutputWriter pattern

Agent lifecycle integration

agent_lifecycle.py: Start ResourceCollector on agent spawn, create ResourceWriter, stop and persist on finalize
agent_pool.py: Add resource_collector field to AgentState
Graceful degradation when psutil unavailable (logs warning, continues without resource tracking)

External review refactoring

Extract panel review orchestration from ReviewerRole to new PanelReviewRole
Move external finding helpers to _handoff_helpers.py for reuse
Simplify AddressExternalFindingsStep by delegating panel review to dedicated role
Consolidate test coverage in test_monitoring.py, remove redundant test files

Test coverage

482 new lines in test_monitoring.py covering ResourceWriter batch insert, summary persistence, cleanup, lifecycle integration, and edge cases (zero samples, missing psutil, dead process)
In-memory SQLite for isolated DB tests

Review guidance

Focus on:

Migration safety: Inspector-based idempotent table creation, backward compatibility with existing runs lacking resource data
Lifecycle timing: Collector start after process spawn, finalize after OutputWriter close, try/except guards for ImportError and NoSuchProcess
Batching logic: 6-sample buffer threshold, bulk insert via session.add_all(), monotonic-to-wallclock conversion
Refactoring scope: External review changes improve separation of concerns but increase diff size—verify no behavior changes in reviewer flow

Trade-offs:

Separate tables for resource data (not columns on TaskRun) keeps TaskRun lean but adds JOINs for queries—charts will need relationships
Wall-clock timestamps for DB (not monotonic offset) sacrifice nanosecond precision but simplify time-range queries
Panel review extraction improves modularity but requires new role registration and handoff wiring

Test plan

make check passes (lint + full test suite including 482 new resource persistence tests)
Manual verification: spawn agent via dashboard, confirm resource_samples and resource_summaries rows inserted after run completes
Edge case coverage: zero samples (early exit), missing psutil (graceful skip), retention cleanup (7-day cutoff)
Migration tested: apply on empty DB and DB with existing task_runs, verify idempotent re-run

Closes #255

coderabbitai · 2026-07-02T17:21:46Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 23725c0a-0d58-47dc-a281-a981731ad7a3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR adds persistent resource monitoring for spawned agents: new resource_samples/resource_summaries DB tables and ORM models with an Alembic migration, a buffered ResourceWriter with retention cleanup, and lifecycle wiring to start/flush/finalize monitoring around agent spawn, stop, and finalization.

Changes

Resource monitoring feature

Layer / File(s)	Summary
DB schema and ORM models `sova/db/migrations/versions/014_add_resource_tables.py`, `sova/db/models.py`	Migration creates `resource_samples` (with composite index) and `resource_summaries` (unique per-run) tables; ORM adds `ResourceSampleRecord`/`ResourceSummaryRecord` and shares a cascade constant across `TaskRun` relationships.
ResourceWriter persistence and cleanup `sova/monitoring/writer.py`, `tests/test_monitoring.py`	`ResourceWriter` buffers samples, flushes/re-queues on failure, writes summaries, and caps buffer size; `cleanup_old_resources` deletes records for runs older than retention; covered by writer/cleanup unit tests.
Agent lifecycle wiring for monitoring `sova/dashboard/services/agent_lifecycle.py`, `sova/dashboard/services/agent_pool.py`, `tests/test_monitoring.py`	`AgentState` gains collector/writer/flush-task fields; monitoring starts on spawn, a background loop periodically flushes samples, `stop_agent` cancels the flush task, and finalization drains samples, writes summaries, and closes the writer before DB task-run finalization.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related issues

Resource monitoring: per-run resource metrics in dashboard #256: Dashboard/API display of resource data depends on the persistence plumbing added in this PR.

Possibly related PRs

xsovad06/sova#181: Both modify agent_lifecycle.py's spawn/finalize coordination that this PR's monitoring start/stop hooks into.
xsovad06/sova#231: Both touch _wait_and_finalize sequencing around agent removal/finalization order.
xsovad06/sova#283: This PR's lifecycle integration builds directly on the ResourceCollector foundation and config wiring from that PR.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the main change: resource monitoring persistence for runs.
Description check	✅ Passed	The description covers the feature, changes, and test plan, but it omits the template's Type of Change and checklist sections.
Linked Issues check	✅ Passed	The resource tables, lifecycle wiring, batching, retention cleanup, and tests align with the requirements in issue `#255`.
Out of Scope Changes check	✅ Passed	The reviewed code stays within the resource-monitoring scope and tests; no unrelated code changes are evident.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sova/monitoring/writer.py`:
- Line 27: Update the type annotations in ResourceWriter.__init__ and
cleanup_old_resources to accept project_dir=None, since all call sites and tests
pass None and get_session already supports Path | None. Keep the signatures
fully typed, but change the project_dir parameter contract from Path to Path |
None so the annotations match actual usage and downstream behavior.
- Around line 81-88: Buffered samples can be dropped when writer.flush() is
cancelled because asyncio.CancelledError bypasses the current except Exception
handler in flush(). Update sova/monitoring/writer.py in ResourceWriter.flush()
to explicitly catch cancellation around the get_session/session.begin write
path, re-queue samples_to_flush back into self._buffer before re-raising the
cancellation, and keep the existing flush_failed logging for real exceptions so
cancellation does not permanently lose buffered records.

In `@tests/test_monitoring.py`:
- Around line 838-844: Add a regression test around ResourceWriter.flush() that
exercises cancellation while the DB write is in progress, since the current
tests only cover the empty-buffer no-op path. In tests/test_monitoring.py,
extend the ResourceWriter coverage by mocking a slow or blocked get_session/DB
write so flush() is actively awaiting, then cancel the flush task and assert the
buffered samples are not silently lost and the cancellation path is handled as
intended. Use the existing test_flush_empty_buffer_noop and the
ResourceWriter.flush / get_session symbols to locate the right place.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e0e1f2c7-ceff-4c5b-9ccf-d8930893fa73

📥 Commits

Reviewing files that changed from the base of the PR and between 5969fbe and 37eb14b.

⛔ Files ignored due to path filters (1)

.sova/test-baseline.json is excluded by none and included by none

📒 Files selected for processing (6)

sova/dashboard/services/agent_lifecycle.py
sova/dashboard/services/agent_pool.py
sova/db/migrations/versions/014_add_resource_tables.py
sova/db/models.py
sova/monitoring/writer.py
tests/test_monitoring.py

Findings addressed in latest push.

…tence Closes #255

- Re-raise asyncio.CancelledError in _resource_flush_loop (agent_lifecycle.py:171) - Extract _CASCADE_ALL_DELETE_ORPHAN constant for duplicated literal (models.py:59) - Skip false positive: CancelledError catch at line 138 awaits a child task's cancellation, not the current coroutine -- suppression is correct - Add 4 tests covering the changed code paths

…onitoring Use asyncio.gather(return_exceptions=True) instead of try/except CancelledError to suppress expected cancellation from child task. Add explicit CancelledError re-raise to outer handler.

…cycle Cover previously uncovered code paths to satisfy SonarCloud's 80% coverage gate on new code: - _start_resource_monitoring success path (collector + writer + flush task) - _start_resource_monitoring import error (graceful degradation) - _finalize_resource_monitoring draining remaining samples - _finalize_resource_monitoring exception handling - _resource_flush_loop drain-and-flush path - ResourceWriter buffer overflow (MAX_BUFFER_SIZE) - ResourceWriter flush failure (re-queue samples) - ResourceWriter write_summary failure - cleanup_old_resources failure path

Closes #255

sonarqubecloud · 2026-07-02T21:58:58Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
97.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

xsovad06 self-assigned this Jul 2, 2026

coderabbitai Bot previously requested changes Jul 2, 2026

View reviewed changes

Comment thread sova/monitoring/writer.py Outdated

Comment thread sova/monitoring/writer.py

Comment thread tests/test_monitoring.py

xsovad06 force-pushed the feat/issue-255 branch from 37eb14b to f08d63f Compare July 2, 2026 21:06

coderabbitai Bot approved these changes Jul 2, 2026

View reviewed changes

xsovad06 added 7 commits July 2, 2026 23:53

feat(core): Resource monitoring: DB model and per-run resource persis…

825ea6c

…tence Closes #255

fix(core): address remaining SonarCloud S7497 in _finalize_resource_m…

021bb0b

…onitoring Use asyncio.gather(return_exceptions=True) instead of try/except CancelledError to suppress expected cancellation from child task. Add explicit CancelledError re-raise to outer handler.

fix(core): lint fixes for line length and unused import

000b864

feat(core): issue 255

3aff059

Closes #255

fix(core): address pre-push hook violations

7ff9d6b

xsovad06 force-pushed the feat/issue-255 branch from f08d63f to 7ff9d6b Compare July 2, 2026 21:53

xsovad06 merged commit 87142b7 into main Jul 2, 2026
8 checks passed

xsovad06 deleted the feat/issue-255 branch July 2, 2026 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#255): Resource monitoring: DB model and per-run resource persistence#300

feat(#255): Resource monitoring: DB model and per-run resource persistence#300
xsovad06 merged 7 commits into
mainfrom
feat/issue-255

xsovad06 commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xsovad06 commented Jul 2, 2026

Summary

Changes

Review guidance

Test plan

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading