feat(experiments): Denormalize rollup metrics onto Experiment + debounced refresh#424
feat(experiments): Denormalize rollup metrics onto Experiment + debounced refresh#424shanaiabuggy wants to merge 4 commits into
Conversation
…nced refresh Signed-off-by: shanaiabuggy <59746633+shanaiabuggy@users.noreply.github.com>
Signed-off-by: shanaiabuggy <59746633+shanaiabuggy@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a background ChangesExperiment Rollup Background Refresh
Sequence Diagram(s)sequenceDiagram
participant Client
participant IngestEndpoint as ingest_atif /<br/> ingest_chat_completion
participant ExperimentRollupRefresher
participant ExperimentRollupRepository as ClickHouse
participant EntityClient as Entity Store
Client->>IngestEndpoint: POST /ingest (with experiment_context)
IngestEndpoint->>EntityClient: store span batch
IngestEndpoint->>ExperimentRollupRefresher: mark_dirty(workspace, experiment_id)
Note over ExperimentRollupRefresher: background _run loop fires after interval_seconds
ExperimentRollupRefresher->>ExperimentRollupRefresher: flush() drains _dirty, groups by workspace
ExperimentRollupRefresher->>ExperimentRollupRepository: get_rollups(workspace, [experiment_ids])
ExperimentRollupRepository-->>ExperimentRollupRefresher: dict[str, ExperimentRollup]
loop per rollup
ExperimentRollupRefresher->>EntityClient: get(Experiment)
ExperimentRollupRefresher->>ExperimentRollupRefresher: rollup_to_metrics(rollup, refreshed_at)
ExperimentRollupRefresher->>EntityClient: update(experiment.metrics = metrics_dict)
end
Client->>IngestEndpoint: GET /experiments
IngestEndpoint->>EntityClient: list Experiments
alt entity.metrics present
IngestEndpoint->>IngestEndpoint: _cached_rollup(entity.metrics) — no ClickHouse call
else entity.metrics absent
IngestEndpoint->>ExperimentRollupRepository: _hydrate_rollups
end
IngestEndpoint-->>Client: enriched experiment list
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
services/intake/src/nmp/intake/spans/experiment_rollup_refresher.py (1)
106-106: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse a concrete type for
metricsin_write_metrics.
metrics: dictis too broad; usedict[str, Any]for concrete typing.
As per coding guidelines, "**/*.py: Always prefer concrete type hints over string-based ones in Python code; do not import types under TYPE_CHECKING, instead import types as regular imports when possible."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/intake/src/nmp/intake/spans/experiment_rollup_refresher.py` at line 106, In the `_write_metrics` method signature, replace the `metrics: dict` parameter type annotation with `dict[str, Any]` to provide a concrete type hint. Import `Any` from the `typing` module as a regular import at the top of the file (not under TYPE_CHECKING) to support this type annotation.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@services/intake/src/nmp/intake/api/v2/experiments/endpoints.py`:
- Around line 374-378: The call to metrics_to_rollup and _apply_rollup on line
375 can throw an exception if entity.metrics is malformed or incompatible, which
causes the entire list_experiments response to fail for that single bad row.
Wrap the _apply_rollup call in a try-except block, and when an exception is
caught during decoding, treat it as a cache miss by appending the response to
the needs_live_hydrate list just like the else branch does, rather than letting
the exception propagate and fail the entire operation.
In `@services/intake/src/nmp/intake/spans/experiment_rollup_refresher.py`:
- Around line 57-67: The stop method can lose experiment updates if the
cancellation occurs while flush is executing. After batch is detached from
self._dirty at lines 83-84, a cancellation would leave those IDs unwritten and
un-requeued. Instead of immediately cancelling the task after setting the
stopping flag, allow the task loop to naturally exit by checking the
self._stopping flag (which is already set), or modify the flush method to be
cancellation-safe by catching asyncio.CancelledError and re-queuing the detached
batch back to self._dirty before the exception propagates. Either approach
ensures no queued updates are lost during shutdown.
In `@services/intake/src/nmp/intake/spans/experiment_rollup_repository.py`:
- Around line 72-75: The metrics_to_rollup function (lines 92-106) reads stored
metrics without validating the version field that rollup_to_metrics writes (line
80), creating forward/backward compatibility issues. Add a version check at the
beginning of metrics_to_rollup that compares the "version" field in the metrics
payload against METRICS_VERSION; if they do not match, route to a fallback
re-hydration/recompute path instead of proceeding with the blind parsing that
could cause misreads or int() conversion errors.
---
Nitpick comments:
In `@services/intake/src/nmp/intake/spans/experiment_rollup_refresher.py`:
- Line 106: In the `_write_metrics` method signature, replace the `metrics:
dict` parameter type annotation with `dict[str, Any]` to provide a concrete type
hint. Import `Any` from the `typing` module as a regular import at the top of
the file (not under TYPE_CHECKING) to support this type annotation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d4b0c327-18aa-4532-b112-bb8fde55fa11
⛔ Files ignored due to path filters (1)
web/packages/sdk/generated/agents/schema/DeploymentLogsResponse.tsis excluded by!**/generated/**
📒 Files selected for processing (11)
services/intake/src/nmp/intake/api/v2/experiments/endpoints.pyservices/intake/src/nmp/intake/config.pyservices/intake/src/nmp/intake/entities/experiments.pyservices/intake/src/nmp/intake/service.pyservices/intake/src/nmp/intake/spans/api/dependencies.pyservices/intake/src/nmp/intake/spans/experiment_rollup_refresher.pyservices/intake/src/nmp/intake/spans/experiment_rollup_repository.pyservices/intake/src/nmp/intake/spans/ingest/atif.pyservices/intake/src/nmp/intake/spans/ingest/chat_completions.pyservices/intake/tests/integration/spans/test_experiment_metrics_refresh.pyservices/intake/tests/test_experiment_rollup_refresher.py
|
See linear issue description for details: https://linear.app/nvidia/issue/ASE-319/denormalize-rollup-metrics-onto-experiment-debounced-refresh
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Performance Improvements
Tests