feat(forecast): add model zoo and backtesting comparison#303
Conversation
PRP-36 (Forecast Intelligence — Slice B). Promote the model layer from "a regression model + three baselines" to a disciplined model zoo with fair, leakage-safe comparison on top of the PRP-35 Feature Frame V2 contract. Models (under model_factory + _MODEL_FAMILY_MAP): - weighted_moving_average — linear or exponential weighting (always-on) - seasonal_average — average of last N seasonal cycles, optional trim (always-on) - trend_regression_baseline — Ridge over elapsed-day + dow/month one-hots (always-on) - random_forest — sklearn RandomForestRegressor, n_jobs=1, gated by forecast_enable_random_forest Backtesting metrics (additive — V1 fold path only): - aggregated_metrics gains rmse alongside mae/smape/wape/bias - FoldResult.horizon_bucket_metrics — per-bucket dict keyed by h_1_7 / h_8_14 / h_15_28 / h_29_plus (empty buckets dropped) - ModelBacktestResult.bucketed_aggregated_metrics — per-bucket means across folds - V2 backtesting dispatch remains DEFERRED to #299 Registry comparable-run rule: - RegistryService._find_duplicate now distinguishes V1 vs V2 (runs with different feature_frame_version are NOT duplicates) - New find_comparable_runs(grain, overlapping window, same V, status==SUCCESS) - RunCreate.runtime_info_extras lets callers pin feature_frame_version + feature_groups - RunResponse.feature_frame_version + feature_groups computed from runtime_info (legacy runs surface None) Ops staleness: - New StaleReason enum value FEATURE_FRAME_VERSION_MISMATCH — a V1 alias with a newer V2 comparable run reports this instead of NEWER_SUCCESS_RUN - AliasHealth and ModelHealthEntry expose alias_feature_frame_version + comparable_run_feature_frame_version Explainability: - New explainers: WeightedMovingAverageExplainer, SeasonalAverageExplainer, TrendRegressionBaselineExplainer - Factory + service plumb weight_strategy / decay / lookback_cycles / trim_outliers - HGBR keeps raising FeatureImportanceUnavailableError (422 path unchanged) Other: - examples/forecasting/model_zoo_compare.py — read-only diagnostic that backtests every available model on a single grain and prints aggregate + per-bucket WAPE - docs/_base/API_CONTRACTS.md, DOMAIN_MODEL.md, docs/optional-features/05 + 09 updated Validation: - ruff check / format clean - mypy --strict / pyright --strict clean (3 mypy + 8 pyright pre-existing xgboost/lightgbm errors only; CI runs --all-extras) - 1574 non-integration tests pass; load-bearing leakage specs unchanged - alembic check — NO new migration (all new state rides existing JSONB columns) Out of scope (deferred): - V2 backtesting fold dispatch — #299 - PRP-37 UI / dashboard — Slice C - /explain/forecast handler for random_forest — needs bundle reload, separate PRP
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reviewer's GuideIntroduces a richer forecasting model zoo (three new target-only baselines plus an optional RandomForest forecaster), adds RMSE and horizon-bucket metrics to backtesting, and wires feature-frame-version awareness into registry and ops so runs and aliases can be compared fairly across model and feature-frame versions, with matching explainability, tests, example script, and docs updates. Sequence diagram for backtesting with bucketed metricssequenceDiagram
actor Client
participant BacktestingService as BacktestingService
participant MetricsCalculator as MetricsCalculator
participant BucketMetrics as compute_bucket_metrics
Client->>BacktestingService: POST /backtesting/run
loop per_fold
BacktestingService->>BacktestingService: model.predict(horizon, X_test)
BacktestingService->>MetricsCalculator: calculate_all(actuals, predictions)
MetricsCalculator-->>BacktestingService: metrics{mae,rmse,smape,wape,bias}
BacktestingService->>BucketMetrics: compute_bucket_metrics(actuals, predictions, horizon_offsets)
BucketMetrics-->>BacktestingService: horizon_bucket_metrics
end
BacktestingService->>MetricsCalculator: aggregate_fold_metrics(fold_metrics)
MetricsCalculator-->>BacktestingService: aggregated_metrics, metric_std
BacktestingService->>MetricsCalculator: aggregate_bucket_metrics(fold_bucket_metrics)
MetricsCalculator-->>BacktestingService: bucketed_aggregated_metrics
BacktestingService-->>Client: ModelBacktestResult(aggregated_metrics.rmse, fold_results.horizon_bucket_metrics, bucketed_aggregated_metrics)
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The seasonal/weighted moving average explainers reimplement parts of the corresponding forecaster logic (e.g. horizon=1 sampling and weight construction); consider factoring these into shared helpers or methods on the forecasters to keep behaviour changes in one place and avoid future drift.
- TrendRegressionBaselineExplainer reconstructs the elapsed-day/DOW/month design row separately from TrendRegressionBaselineForecaster; it would be safer to reuse the forecaster’s design helpers (or a shared utility) so any future encoding changes remain consistent between training and explanation.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The seasonal/weighted moving average explainers reimplement parts of the corresponding forecaster logic (e.g. horizon=1 sampling and weight construction); consider factoring these into shared helpers or methods on the forecasters to keep behaviour changes in one place and avoid future drift.
- TrendRegressionBaselineExplainer reconstructs the elapsed-day/DOW/month design row separately from TrendRegressionBaselineForecaster; it would be safer to reuse the forecaster’s design helpers (or a shared utility) so any future encoding changes remain consistent between training and explanation.
## Individual Comments
### Comment 1
<location path="app/features/ops/service.py" line_range="138" />
<code_context>
return "stable", delta
+def _run_feature_frame_version(run: ModelRun) -> int | None:
+ """Read ``feature_frame_version`` from ``run.runtime_info`` JSONB (PRP-36).
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Feature-frame version treats missing key as None here but as V=1 in filters, causing inconsistent staleness classification.
`_run_feature_frame_version` returns `None` when the key is missing, while `_feature_frame_version_filter` treats missing JSONB keys as `1`. As a result, legacy runs (no key) and runs with `feature_frame_version == 1` are considered equivalent in queries, but `_alias_staleness` and `get_model_health` will see `alias_v is None` vs `latest_v == 1` and report a `FEATURE_FRAME_VERSION_MISMATCH`. Please align the semantics by either normalising missing/invalid values to `1` here (or at callsites), or by changing `_feature_frame_version_filter` so that missing keys map to a distinct version instead of `1`.
</issue_to_address>
### Comment 2
<location path="app/features/explainability/explainers.py" line_range="280-289" />
<code_context>
+ arr = np.sort(arr)[1:-1]
+ forecast = float(arr.mean())
+ trim_note = " after trimming the min + max samples" if used_trim else ""
+ drivers = [
+ DriverContribution(
+ name="seasonal_window_mean",
+ feature_value=forecast,
+ contribution=forecast,
+ direction="positive",
+ description=(
+ f"The forecast averages the values from the last {len(samples)} "
+ f"matching seasonal positions (every {self.season_length} days){trim_note}."
+ ),
+ ),
+ DriverContribution(
+ name="sample_dispersion",
+ feature_value=float(np.std(samples)),
+ contribution=0.0,
+ direction="neutral",
</code_context>
<issue_to_address>
**suggestion:** SeasonalAverageExplainer mixes trimmed and untrimmed samples for dispersion, which may be inconsistent with the description.
When `trim_outliers` is enabled and `arr.size >= 4`, you trim min/max into `arr` for the forecast but still compute `sample_dispersion` from `np.std(samples)` (untrimmed). To avoid confusion, either compute dispersion from the trimmed `arr` as well, or update the description to state that dispersion uses the raw samples before trimming.
Suggested implementation:
```python
DriverContribution(
name="sample_dispersion",
feature_value=float(np.std(arr if used_trim else samples)),
contribution=0.0,
direction="neutral",
```
This change assumes:
1. `arr` is defined as a NumPy array containing either the trimmed or untrimmed seasonal samples immediately before this `drivers` list.
2. `used_trim` is a boolean flag indicating whether trimming was applied (`True` when `trim_outliers` is enabled and `arr.size >= 4` before trimming).
If `used_trim` does not yet exist, you should set it where you trim: e.g., `used_trim = trim_outliers and arr.size >= 4` and then conditionally apply `arr = np.sort(arr)[1:-1]` when `used_trim` is `True`.
</issue_to_address>
### Comment 3
<location path="app/features/forecasting/tests/test_random_forest_forecaster.py" line_range="135-138" />
<code_context>
+ with pytest.raises(ValueError, match="n_estimators"):
+ RandomForestForecaster(n_estimators=0)
+
+ def test_invalid_max_depth_raises(self) -> None:
+ """max_depth below the minimum surfaces a clear error."""
+ with pytest.raises(ValueError, match="max_depth"):
+ RandomForestForecaster(max_depth=0)
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for `min_samples_leaf < 1` to cover all constructor validation branches
To round this out, please add a test asserting that `min_samples_leaf < 1` raises `ValueError` with the expected message, so the remaining constructor validation branch is covered and stays aligned with the implementation and docs.
</issue_to_address>
### Comment 4
<location path="docs/optional-features/09-model-champion-challenger-governance.md" line_range="30-32" />
<code_context>
+`OpsService.get_summary` uses the same predicate to classify staleness.
+When an alias's run has `V_a` and a newer comparable SUCCESS run has
+`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason
+`feature_frame_version_mismatch` (a distinct value from
+`newer_success_run`) so Slice C can render the mismatch separately.
</code_context>
<issue_to_address>
**issue (typo):** Use `stale_reason` consistently instead of `stale-reason`.
This doc uses `stale-reason` while other references (e.g., DOMAIN_MODEL.md) use the field name `stale_reason`. If `stale_reason` is canonical, please update this to match for consistency and to avoid confusion.
```suggestion
`OpsService.get_summary` uses the same predicate to classify staleness.
When an alias's run has `V_a` and a newer comparable SUCCESS run has
`V_b != V_a`, the alias is marked `is_stale=true` with `stale_reason`
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| return "stable", delta | ||
|
|
||
|
|
||
| def _run_feature_frame_version(run: ModelRun) -> int | None: |
There was a problem hiding this comment.
issue (bug_risk): Feature-frame version treats missing key as None here but as V=1 in filters, causing inconsistent staleness classification.
_run_feature_frame_version returns None when the key is missing, while _feature_frame_version_filter treats missing JSONB keys as 1. As a result, legacy runs (no key) and runs with feature_frame_version == 1 are considered equivalent in queries, but _alias_staleness and get_model_health will see alias_v is None vs latest_v == 1 and report a FEATURE_FRAME_VERSION_MISMATCH. Please align the semantics by either normalising missing/invalid values to 1 here (or at callsites), or by changing _feature_frame_version_filter so that missing keys map to a distinct version instead of 1.
| drivers = [ | ||
| DriverContribution( | ||
| name="weighted_window_mean", | ||
| feature_value=forecast, | ||
| contribution=forecast, | ||
| direction="positive", | ||
| description=( | ||
| f"The forecast is the {self.weight_strategy}-weighted mean of the " | ||
| f"last {self.window_size} observed values" | ||
| + (f" (decay={self.decay})." if self.weight_strategy == "exponential" else ".") |
There was a problem hiding this comment.
suggestion: SeasonalAverageExplainer mixes trimmed and untrimmed samples for dispersion, which may be inconsistent with the description.
When trim_outliers is enabled and arr.size >= 4, you trim min/max into arr for the forecast but still compute sample_dispersion from np.std(samples) (untrimmed). To avoid confusion, either compute dispersion from the trimmed arr as well, or update the description to state that dispersion uses the raw samples before trimming.
Suggested implementation:
DriverContribution(
name="sample_dispersion",
feature_value=float(np.std(arr if used_trim else samples)),
contribution=0.0,
direction="neutral",This change assumes:
arris defined as a NumPy array containing either the trimmed or untrimmed seasonal samples immediately before thisdriverslist.used_trimis a boolean flag indicating whether trimming was applied (Truewhentrim_outliersis enabled andarr.size >= 4before trimming).
Ifused_trimdoes not yet exist, you should set it where you trim: e.g.,used_trim = trim_outliers and arr.size >= 4and then conditionally applyarr = np.sort(arr)[1:-1]whenused_trimisTrue.
| def test_invalid_max_depth_raises(self) -> None: | ||
| """max_depth below the minimum surfaces a clear error.""" | ||
| with pytest.raises(ValueError, match="max_depth"): | ||
| RandomForestForecaster(max_depth=0) |
There was a problem hiding this comment.
suggestion (testing): Add a test for min_samples_leaf < 1 to cover all constructor validation branches
To round this out, please add a test asserting that min_samples_leaf < 1 raises ValueError with the expected message, so the remaining constructor validation branch is covered and stays aligned with the implementation and docs.
| `OpsService.get_summary` uses the same predicate to classify staleness. | ||
| When an alias's run has `V_a` and a newer comparable SUCCESS run has | ||
| `V_b != V_a`, the alias is marked `is_stale=true` with stale-reason |
There was a problem hiding this comment.
issue (typo): Use stale_reason consistently instead of stale-reason.
This doc uses stale-reason while other references (e.g., DOMAIN_MODEL.md) use the field name stale_reason. If stale_reason is canonical, please update this to match for consistency and to avoid confusion.
| `OpsService.get_summary` uses the same predicate to classify staleness. | |
| When an alias's run has `V_a` and a newer comparable SUCCESS run has | |
| `V_b != V_a`, the alias is marked `is_stale=true` with stale-reason | |
| `OpsService.get_summary` uses the same predicate to classify staleness. | |
| When an alias's run has `V_a` and a newer comparable SUCCESS run has | |
| `V_b != V_a`, the alias is marked `is_stale=true` with `stale_reason` |
CodeRabbit review on PR #303 surfaced one bug-risk + one consistency issue + one missing test + one doc typo + an overall refactor request. All five addressed. 1. BUG-RISK — _run_feature_frame_version returned None for missing JSONB keys while _feature_frame_version_filter treats them as V=1. _alias_staleness compared None != 1 and spuriously surfaced FEATURE_FRAME_VERSION_MISMATCH for a legacy alias against an explicit-V=1 comparable run. Normalized the ops helper to return V=1 for missing keys (matches the registry filter contract). The schema-side RunResponse.feature_frame_version still surfaces None so UIs can distinguish "no V info" from "V=1". 2. REFACTOR — Extracted shared pure helpers in forecasting/models.py: - compute_weighted_average_weights - compute_seasonal_average_for_offset - build_trend_baseline_design_row The forecasters' fit/predict + the three new explainers now call them as the single source of truth. No more two-place drift risk when a default changes. 3. CONSISTENCY — SeasonalAverageExplainer.sample_dispersion now measures the same array the forecast was averaged from (post-trim when trim_outliers is on; raw otherwise). Description updated to match. 4. TESTING — Added test_invalid_min_samples_leaf_raises to round out RandomForestForecaster's constructor-validation branches. 5. TYPO — docs/optional-features/09-…governance.md uses the `stale_reason` field-name form (no hyphen) to match DOMAIN_MODEL.md / API_CONTRACTS.md. Plus: two new ops tests pin the new V=1 normalization contract (`_run_feature_frame_version_rejects_unsupported_value`, `_alias_staleness_legacy_run_treated_as_v1_no_spurious_mismatch`). Validation: ruff / mypy --strict / pyright --strict clean (same 3+8 pre-existing xgboost/lightgbm errors only). 1577 non-integration tests pass (+3 new). Leakage specs unchanged.
Summary
PRP-36 (Forecast Intelligence — Slice B) on top of the PRP-35 Feature Frame V2 contract. Promotes the model layer from "a regression model + three baselines" to a disciplined model zoo with fair, leakage-safe comparison.
Closes #302.
Implemented PRP-36 Tasks
devWeightedMovingAverageForecaster(linear + exponential, always-on)SeasonalAverageForecaster(N-cycle average, optional outlier-trim)TrendRegressionBaselineForecaster(Ridge over elapsed-day + dow/month)RandomForestForecaster(sklearn,n_jobs=1, gated byforecast_enable_random_forest)bundle_hashimpactHORIZON_BUCKETS+compute_bucket_metrics+aggregate_bucket_metricsFoldResult.horizon_bucket_metrics,bucketed_aggregated_metricsRegistryService._find_duplicate+find_comparable_runsV-awareRunResponse.feature_frame_version+feature_groupscomputed fromruntime_infoOpsServiceV-mismatch stale-reason pathStaleReasonenum + V exposure onAliasHealth/ModelHealthEntryweighted_moving_average,seasonal_average,trend_regression_baseline)examples/forecasting/model_zoo_compare.pydiagnostic scriptalembic check— no new migrationExplicitly Deferred
backtesting/service.pykeeps using the V1 builders. The redesigned surface (additivefeature_frame_version+feature_groupsonBacktestConfig+ V2 fold feature construction + loader sharing) is tracked at feat(forecast): add feature frame v2 (PRP-35) #299 per the PR feat(forecast): add feature frame v2 #300 deferral./explain/forecastforrandom_forest— needs bundle reload; outside PRP-36 scope.Deviations from the PRP
await registry_service.find_comparable_runs(...)" insideOpsService.get_summary. The existing batch query (one SQL withDISTINCT ON (store_id, product_id)) is significantly more efficient than per-grainfind_comparable_runscalls. I kept the batch query and added V-aware classification on top of it — the canonicalfind_comparable_runsis still defined onRegistryServicefor future callers (Slice C will use it from REST).CHANGELOG.mdis release-please-managed and picks up the commit subject automatically.Validation Results
Files Changed
Stash Status
stash@{0}: On dev: local qwen3 rag demo changes before prp-35— untouched throughout PRP-36 execution.Test Plan
dev(ruff + mypy + pyright + pytest + migration-check)pytest -m integration)POST /backtesting/runresponse shape includesaggregated_metrics.rmseandmain_model_results.bucketed_aggregated_metricsGET /registry/runs/{id}response shape carriesfeature_frame_version+feature_groups(ornullfor legacy runs)examples/forecasting/model_zoo_compare.pyagainst the local seeded DBSummary by Sourcery
Introduce new forecasting baselines and an optional Random Forest model, extend backtesting metrics with RMSE and horizon buckets, and make registry/ops version-aware for feature frame governance and explainability.
New Features:
Enhancements:
Tests: