Skip to content

feat(#255): Resource monitoring: DB model and per-run resource persistence#300

Merged
xsovad06 merged 7 commits into
mainfrom
feat/issue-255
Jul 2, 2026
Merged

feat(#255): Resource monitoring: DB model and per-run resource persistence#300
xsovad06 merged 7 commits into
mainfrom
feat/issue-255

Conversation

@xsovad06

@xsovad06 xsovad06 commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add database persistence for resource monitoring data with two new tables: resource_samples (time-series metrics) and resource_summaries (per-run aggregates)
  • Wire ResourceCollector into agent lifecycle so every dashboard-spawned agent run automatically tracks CPU, memory, I/O, and thread metrics
  • Refactor external review handling: extract panel review logic to dedicated PanelReviewRole, simplify reviewer handoff flow

Changes

Database layer

  • New migration 014_add_resource_tables.py creates resource_samples and resource_summaries tables with FK to task_runs and cascade delete
  • New ORM models ResourceSampleRecord and ResourceSummaryRecord in sova/db/models.py with relationships on TaskRun

Resource persistence

  • New ResourceWriter class in sova/monitoring/writer.py implementing batched sample insertion (buffers 6 samples = 30s), summary persistence, and retention cleanup (7-day default)
  • Converts monotonic timestamps to wall-clock for DB storage
  • Non-fatal error handling matches OutputWriter pattern

Agent lifecycle integration

  • agent_lifecycle.py: Start ResourceCollector on agent spawn, create ResourceWriter, stop and persist on finalize
  • agent_pool.py: Add resource_collector field to AgentState
  • Graceful degradation when psutil unavailable (logs warning, continues without resource tracking)

External review refactoring

  • Extract panel review orchestration from ReviewerRole to new PanelReviewRole
  • Move external finding helpers to _handoff_helpers.py for reuse
  • Simplify AddressExternalFindingsStep by delegating panel review to dedicated role
  • Consolidate test coverage in test_monitoring.py, remove redundant test files

Test coverage

  • 482 new lines in test_monitoring.py covering ResourceWriter batch insert, summary persistence, cleanup, lifecycle integration, and edge cases (zero samples, missing psutil, dead process)
  • In-memory SQLite for isolated DB tests

Review guidance

Focus on:

  • Migration safety: Inspector-based idempotent table creation, backward compatibility with existing runs lacking resource data
  • Lifecycle timing: Collector start after process spawn, finalize after OutputWriter close, try/except guards for ImportError and NoSuchProcess
  • Batching logic: 6-sample buffer threshold, bulk insert via session.add_all(), monotonic-to-wallclock conversion
  • Refactoring scope: External review changes improve separation of concerns but increase diff size—verify no behavior changes in reviewer flow

Trade-offs:

  • Separate tables for resource data (not columns on TaskRun) keeps TaskRun lean but adds JOINs for queries—charts will need relationships
  • Wall-clock timestamps for DB (not monotonic offset) sacrifice nanosecond precision but simplify time-range queries
  • Panel review extraction improves modularity but requires new role registration and handoff wiring

Test plan

  • make check passes (lint + full test suite including 482 new resource persistence tests)
  • Manual verification: spawn agent via dashboard, confirm resource_samples and resource_summaries rows inserted after run completes
  • Edge case coverage: zero samples (early exit), missing psutil (graceful skip), retention cleanup (7-day cutoff)
  • Migration tested: apply on empty DB and DB with existing task_runs, verify idempotent re-run

Closes #255

@xsovad06 xsovad06 self-assigned this Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 23725c0a-0d58-47dc-a281-a981731ad7a3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR adds persistent resource monitoring for spawned agents: new resource_samples/resource_summaries DB tables and ORM models with an Alembic migration, a buffered ResourceWriter with retention cleanup, and lifecycle wiring to start/flush/finalize monitoring around agent spawn, stop, and finalization.

Changes

Resource monitoring feature

Layer / File(s) Summary
DB schema and ORM models
sova/db/migrations/versions/014_add_resource_tables.py, sova/db/models.py
Migration creates resource_samples (with composite index) and resource_summaries (unique per-run) tables; ORM adds ResourceSampleRecord/ResourceSummaryRecord and shares a cascade constant across TaskRun relationships.
ResourceWriter persistence and cleanup
sova/monitoring/writer.py, tests/test_monitoring.py
ResourceWriter buffers samples, flushes/re-queues on failure, writes summaries, and caps buffer size; cleanup_old_resources deletes records for runs older than retention; covered by writer/cleanup unit tests.
Agent lifecycle wiring for monitoring
sova/dashboard/services/agent_lifecycle.py, sova/dashboard/services/agent_pool.py, tests/test_monitoring.py
AgentState gains collector/writer/flush-task fields; monitoring starts on spawn, a background loop periodically flushes samples, stop_agent cancels the flush task, and finalization drains samples, writes summaries, and closes the writer before DB task-run finalization.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related issues

Possibly related PRs

  • xsovad06/sova#181: Both modify agent_lifecycle.py's spawn/finalize coordination that this PR's monitoring start/stop hooks into.
  • xsovad06/sova#231: Both touch _wait_and_finalize sequencing around agent removal/finalization order.
  • xsovad06/sova#283: This PR's lifecycle integration builds directly on the ResourceCollector foundation and config wiring from that PR.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: resource monitoring persistence for runs.
Description check ✅ Passed The description covers the feature, changes, and test plan, but it omits the template's Type of Change and checklist sections.
Linked Issues check ✅ Passed The resource tables, lifecycle wiring, batching, retention cleanup, and tests align with the requirements in issue #255.
Out of Scope Changes check ✅ Passed The reviewed code stays within the resource-monitoring scope and tests; no unrelated code changes are evident.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

coderabbitai[bot]
coderabbitai Bot previously requested changes Jul 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sova/monitoring/writer.py`:
- Line 27: Update the type annotations in ResourceWriter.__init__ and
cleanup_old_resources to accept project_dir=None, since all call sites and tests
pass None and get_session already supports Path | None. Keep the signatures
fully typed, but change the project_dir parameter contract from Path to Path |
None so the annotations match actual usage and downstream behavior.
- Around line 81-88: Buffered samples can be dropped when writer.flush() is
cancelled because asyncio.CancelledError bypasses the current except Exception
handler in flush(). Update sova/monitoring/writer.py in ResourceWriter.flush()
to explicitly catch cancellation around the get_session/session.begin write
path, re-queue samples_to_flush back into self._buffer before re-raising the
cancellation, and keep the existing flush_failed logging for real exceptions so
cancellation does not permanently lose buffered records.

In `@tests/test_monitoring.py`:
- Around line 838-844: Add a regression test around ResourceWriter.flush() that
exercises cancellation while the DB write is in progress, since the current
tests only cover the empty-buffer no-op path. In tests/test_monitoring.py,
extend the ResourceWriter coverage by mocking a slow or blocked get_session/DB
write so flush() is actively awaiting, then cancel the flush task and assert the
buffered samples are not silently lost and the cancellation path is handled as
intended. Use the existing test_flush_empty_buffer_noop and the
ResourceWriter.flush / get_session symbols to locate the right place.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e0e1f2c7-ceff-4c5b-9ccf-d8930893fa73

📥 Commits

Reviewing files that changed from the base of the PR and between 5969fbe and 37eb14b.

⛔ Files ignored due to path filters (1)
  • .sova/test-baseline.json is excluded by none and included by none
📒 Files selected for processing (6)
  • sova/dashboard/services/agent_lifecycle.py
  • sova/dashboard/services/agent_pool.py
  • sova/db/migrations/versions/014_add_resource_tables.py
  • sova/db/models.py
  • sova/monitoring/writer.py
  • tests/test_monitoring.py

Comment thread sova/monitoring/writer.py Outdated
Comment thread sova/monitoring/writer.py
Comment thread tests/test_monitoring.py
@xsovad06 xsovad06 dismissed coderabbitai[bot]’s stale review July 2, 2026 21:06

Findings addressed in latest push.

xsovad06 added 7 commits July 2, 2026 23:53
- Re-raise asyncio.CancelledError in _resource_flush_loop (agent_lifecycle.py:171)
- Extract _CASCADE_ALL_DELETE_ORPHAN constant for duplicated literal (models.py:59)
- Skip false positive: CancelledError catch at line 138 awaits a child task's
  cancellation, not the current coroutine -- suppression is correct
- Add 4 tests covering the changed code paths
…onitoring

Use asyncio.gather(return_exceptions=True) instead of try/except
CancelledError to suppress expected cancellation from child task.
Add explicit CancelledError re-raise to outer handler.
…cycle

Cover previously uncovered code paths to satisfy SonarCloud's 80% coverage
gate on new code:

- _start_resource_monitoring success path (collector + writer + flush task)
- _start_resource_monitoring import error (graceful degradation)
- _finalize_resource_monitoring draining remaining samples
- _finalize_resource_monitoring exception handling
- _resource_flush_loop drain-and-flush path
- ResourceWriter buffer overflow (MAX_BUFFER_SIZE)
- ResourceWriter flush failure (re-queue samples)
- ResourceWriter write_summary failure
- cleanup_old_resources failure path
@sonarqubecloud

sonarqubecloud Bot commented Jul 2, 2026

Copy link
Copy Markdown

@xsovad06 xsovad06 merged commit 87142b7 into main Jul 2, 2026
8 checks passed
@xsovad06 xsovad06 deleted the feat/issue-255 branch July 2, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resource monitoring: DB model and per-run resource persistence

1 participant