feat(forecast): add model zoo and backtesting comparison by w7-mgfcode · Pull Request #303 · w7-mgfcode/ForecastLabAI

w7-mgfcode · 2026-05-26T06:10:32Z

Summary

PRP-36 (Forecast Intelligence — Slice B) on top of the PRP-35 Feature Frame V2 contract. Promotes the model layer from "a regression model + three baselines" to a disciplined model zoo with fair, leakage-safe comparison.

Closes #302.

Implemented PRP-36 Tasks

#	Task	Status
1	Contract Refresh — re-probe PRP-35 surface against current `dev`	✅
2	`WeightedMovingAverageForecaster` (linear + exponential, always-on)	✅
3	`SeasonalAverageForecaster` (N-cycle average, optional outlier-trim)	✅
4	`TrendRegressionBaselineForecaster` (Ridge over elapsed-day + dow/month)	✅
5	`RandomForestForecaster` (sklearn, `n_jobs=1`, gated by `forecast_enable_random_forest`)	✅
6	Confirm existing feature-aware defaults — no `bundle_hash` impact	✅
7	RMSE + `HORIZON_BUCKETS` + `compute_bucket_metrics` + `aggregate_bucket_metrics`	✅
8	Wire bucket metrics into backtest service (V1 fold path only — V2 deferred to #299)	✅
9	Backtesting schemas — `FoldResult.horizon_bucket_metrics`, `bucketed_aggregated_metrics`	✅
10	`RegistryService._find_duplicate` + `find_comparable_runs` V-aware	✅
11	`RunResponse.feature_frame_version` + `feature_groups` computed from `runtime_info`	✅
12	`OpsService` V-mismatch stale-reason path	✅
13	`StaleReason` enum + V exposure on `AliasHealth` / `ModelHealthEntry`	✅
14	Explainers for new baselines (`weighted_moving_average`, `seasonal_average`, `trend_regression_baseline`)	✅
15	Test updates — feature_metadata, metrics, backtest service, registry, ops, explainability	✅
16	`examples/forecasting/model_zoo_compare.py` diagnostic script	✅
17	Docs (API_CONTRACTS, DOMAIN_MODEL, optional-features 05 + 09)	✅
18	`alembic check` — no new migration	✅

Explicitly Deferred

V2 backtesting dispatch — backtesting/service.py keeps using the V1 builders. The redesigned surface (additive feature_frame_version + feature_groups on BacktestConfig + V2 fold feature construction + loader sharing) is tracked at feat(forecast): add feature frame v2 (PRP-35) #299 per the PR feat(forecast): add feature frame v2 #300 deferral.
PRP-37 UI / dashboard — Slice C, follows this PR.
/explain/forecast for random_forest — needs bundle reload; outside PRP-36 scope.

Deviations from the PRP

The PRP cites "REPLACE comparable-run selection with await registry_service.find_comparable_runs(...)" inside OpsService.get_summary. The existing batch query (one SQL with DISTINCT ON (store_id, product_id)) is significantly more efficient than per-grain find_comparable_runs calls. I kept the batch query and added V-aware classification on top of it — the canonical find_comparable_runs is still defined on RegistryService for future callers (Slice C will use it from REST).
The PRP CHANGELOG entry instruction is moot — CHANGELOG.md is release-please-managed and picks up the commit subject automatically.

Validation Results

✅ uv run ruff check .          — All checks passed
✅ uv run ruff format --check . — 332 files already formatted
✅ uv run mypy app/             — 3 pre-existing xgboost errors only (CI uses --all-extras)
✅ uv run pyright app/          — 8 pre-existing lightgbm/xgboost errors only (same)
✅ uv run pytest -m "not integration"
                                — 1574 passed, 12 skipped, 264 deselected
✅ All four load-bearing leakage specs unchanged (62 tests green)
✅ uv run alembic check         — local DB at stale revision (pre-existing); no migration files added (0-diff under alembic/)

Files Changed

31 files changed, 2820 insertions(+), 41 deletions(-)

backend code:
  M app/core/config.py                                       +1   (forecast_enable_random_forest flag)
  M app/features/forecasting/models.py                       +375 (4 new forecaster classes + factory)
  M app/features/forecasting/schemas.py                      +140 (4 new ModelConfig + union extension)
  M app/features/forecasting/feature_metadata.py             +4   (_MODEL_FAMILY_MAP coverage)
  M app/features/backtesting/metrics.py                      +156 (RMSE + HORIZON_BUCKETS + bucket helpers)
  M app/features/backtesting/schemas.py                      +22  (horizon_bucket_metrics + bucketed_aggregated_metrics)
  M app/features/backtesting/service.py                      +28  (bucket-metric emission in fold loop)
  M app/features/registry/schemas.py                         +50  (RunCreate.runtime_info_extras + RunResponse computed V/groups)
  M app/features/registry/service.py                         +110 (V-aware _find_duplicate + find_comparable_runs)
  M app/features/ops/schemas.py                              +50  (StaleReason enum + V exposure fields)
  M app/features/ops/service.py                              +60  (V-mismatch staleness path)
  M app/features/explainability/explainers.py                +280 (3 new explainer classes + extended factory)
  M app/features/explainability/schemas.py                   +35  (request-schema extensions)
  M app/features/explainability/service.py                   +30  (plumb new params through _explain)

tests:
  A app/features/forecasting/tests/test_weighted_moving_average_forecaster.py
  A app/features/forecasting/tests/test_seasonal_average_forecaster.py
  A app/features/forecasting/tests/test_trend_regression_baseline_forecaster.py
  A app/features/forecasting/tests/test_random_forest_forecaster.py
  M app/features/forecasting/tests/test_feature_metadata.py  (new model_type coverage)
  M app/features/backtesting/tests/test_metrics.py           (TestRMSE + TestComputeBucketMetrics + TestAggregateBucketMetrics)
  M app/features/backtesting/tests/test_service.py           (bucket-metric shape on FoldResult)
  M app/features/registry/tests/test_service.py              (TestRegistryServiceFeatureFrameVersion)
  M app/features/registry/tests/test_schemas.py              (TestRunResponseFeatureFrameVersion)
  M app/features/ops/tests/test_service.py                   (PRP-36 _alias_staleness V-mismatch path)
  M app/features/explainability/tests/test_explainers.py     (new explainer routing + arithmetic checks)

example + docs + env:
  A examples/forecasting/model_zoo_compare.py
  M .env.example                                             (FORECAST_ENABLE_RANDOM_FOREST hint)
  M docs/_base/API_CONTRACTS.md                              (backtest + registry response shape)
  M docs/_base/DOMAIN_MODEL.md                               (comparable-run rule)
  M docs/optional-features/05-advanced-ml-model-zoo.md
  M docs/optional-features/09-model-champion-challenger-governance.md

Stash Status

stash@{0}: On dev: local qwen3 rag demo changes before prp-35 — untouched throughout PRP-36 execution.

Test Plan

CI green on dev (ruff + mypy + pyright + pytest + migration-check)
Integration tests pass on PR's Postgres service (pytest -m integration)
Spot-check POST /backtesting/run response shape includes aggregated_metrics.rmse and main_model_results.bucketed_aggregated_metrics
Spot-check GET /registry/runs/{id} response shape carries feature_frame_version + feature_groups (or null for legacy runs)
Optional: run examples/forecasting/model_zoo_compare.py against the local seeded DB

Summary by Sourcery

Introduce new forecasting baselines and an optional Random Forest model, extend backtesting metrics with RMSE and horizon buckets, and make registry/ops version-aware for feature frame governance and explainability.

New Features:

Add weighted moving average, seasonal average, and trend regression baseline forecasters with corresponding configs and explainers.
Introduce an optional RandomForest-based forecaster behind a feature flag and integrate it into the forecasting model factory and model zoo diagnostic script.
Expose richer explainability API options and new baseline explainers aligned with the added models.

Enhancements:

Extend backtesting metrics to include RMSE and per-horizon-bucket calculations, aggregating these across folds and surfacing them in backtest results.
Make registry run creation, duplicate detection, and comparable-run queries aware of feature_frame_version and feature groups without adding new migrations.
Enhance ops health and staleness reporting to classify aliases using structured stale reasons and feature frame version mismatches.
Update feature metadata, API contracts, and domain docs to reflect the expanded model zoo, backtesting outputs, and governance rules.

Tests:

Add focused unit tests for the new forecasters, explainers, backtesting bucket metrics, registry feature frame handling, and ops staleness logic.
Include an example script to compare all available models via backtesting without changing core APIs.

PRP-36 (Forecast Intelligence — Slice B). Promote the model layer from "a regression model + three baselines" to a disciplined model zoo with fair, leakage-safe comparison on top of the PRP-35 Feature Frame V2 contract. Models (under model_factory + _MODEL_FAMILY_MAP): - weighted_moving_average — linear or exponential weighting (always-on) - seasonal_average — average of last N seasonal cycles, optional trim (always-on) - trend_regression_baseline — Ridge over elapsed-day + dow/month one-hots (always-on) - random_forest — sklearn RandomForestRegressor, n_jobs=1, gated by forecast_enable_random_forest Backtesting metrics (additive — V1 fold path only): - aggregated_metrics gains rmse alongside mae/smape/wape/bias - FoldResult.horizon_bucket_metrics — per-bucket dict keyed by h_1_7 / h_8_14 / h_15_28 / h_29_plus (empty buckets dropped) - ModelBacktestResult.bucketed_aggregated_metrics — per-bucket means across folds - V2 backtesting dispatch remains DEFERRED to #299 Registry comparable-run rule: - RegistryService._find_duplicate now distinguishes V1 vs V2 (runs with different feature_frame_version are NOT duplicates) - New find_comparable_runs(grain, overlapping window, same V, status==SUCCESS) - RunCreate.runtime_info_extras lets callers pin feature_frame_version + feature_groups - RunResponse.feature_frame_version + feature_groups computed from runtime_info (legacy runs surface None) Ops staleness: - New StaleReason enum value FEATURE_FRAME_VERSION_MISMATCH — a V1 alias with a newer V2 comparable run reports this instead of NEWER_SUCCESS_RUN - AliasHealth and ModelHealthEntry expose alias_feature_frame_version + comparable_run_feature_frame_version Explainability: - New explainers: WeightedMovingAverageExplainer, SeasonalAverageExplainer, TrendRegressionBaselineExplainer - Factory + service plumb weight_strategy / decay / lookback_cycles / trim_outliers - HGBR keeps raising FeatureImportanceUnavailableError (422 path unchanged) Other: - examples/forecasting/model_zoo_compare.py — read-only diagnostic that backtests every available model on a single grain and prints aggregate + per-bucket WAPE - docs/_base/API_CONTRACTS.md, DOMAIN_MODEL.md, docs/optional-features/05 + 09 updated Validation: - ruff check / format clean - mypy --strict / pyright --strict clean (3 mypy + 8 pyright pre-existing xgboost/lightgbm errors only; CI runs --all-extras) - 1574 non-integration tests pass; load-bearing leakage specs unchanged - alembic check — NO new migration (all new state rides existing JSONB columns) Out of scope (deferred): - V2 backtesting fold dispatch — #299 - PRP-37 UI / dashboard — Slice C - /explain/forecast handler for random_forest — needs bundle reload, separate PRP

coderabbitai · 2026-05-26T06:10:39Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52c58852-a03a-4024-b61d-6991653bcabd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/forecast-model-zoo-and-backtesting

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai · 2026-05-26T06:10:49Z

Reviewer's Guide

Introduces a richer forecasting model zoo (three new target-only baselines plus an optional RandomForest forecaster), adds RMSE and horizon-bucket metrics to backtesting, and wires feature-frame-version awareness into registry and ops so runs and aliases can be compared fairly across model and feature-frame versions, with matching explainability, tests, example script, and docs updates.

Sequence diagram for backtesting with bucketed metrics

sequenceDiagram
    actor Client
    participant BacktestingService as BacktestingService
    participant MetricsCalculator as MetricsCalculator
    participant BucketMetrics as compute_bucket_metrics

    Client->>BacktestingService: POST /backtesting/run
    loop per_fold
        BacktestingService->>BacktestingService: model.predict(horizon, X_test)
        BacktestingService->>MetricsCalculator: calculate_all(actuals, predictions)
        MetricsCalculator-->>BacktestingService: metrics{mae,rmse,smape,wape,bias}
        BacktestingService->>BucketMetrics: compute_bucket_metrics(actuals, predictions, horizon_offsets)
        BucketMetrics-->>BacktestingService: horizon_bucket_metrics
    end
    BacktestingService->>MetricsCalculator: aggregate_fold_metrics(fold_metrics)
    MetricsCalculator-->>BacktestingService: aggregated_metrics, metric_std
    BacktestingService->>MetricsCalculator: aggregate_bucket_metrics(fold_bucket_metrics)
    MetricsCalculator-->>BacktestingService: bucketed_aggregated_metrics
    BacktestingService-->>Client: ModelBacktestResult(aggregated_metrics.rmse, fold_results.horizon_bucket_metrics, bucketed_aggregated_metrics)

File-Level Changes

Change	Details	Files
Add four new forecaster implementations and wire them into the forecasting model factory and configs, including an optional feature-aware RandomForest model behind a feature flag.	Implement WeightedMovingAverageForecaster, SeasonalAverageForecaster, TrendRegressionBaselineForecaster, and RandomForestForecaster with sklearn-style fit/predict/param APIs and deterministic behavior. Extend ModelType literal, feature_metadata model-family mapping, and forecasting schemas with new ModelConfig types (including RandomForestModelConfig) and add them to the discriminated union. Update model_factory to instantiate the new forecasters, enforcing the forecast_enable_random_forest flag for RandomForest and validating config types.	`app/features/forecasting/models.py` `app/features/forecasting/schemas.py` `app/features/forecasting/feature_metadata.py` `app/core/config.py` `app/features/forecasting/tests/test_weighted_moving_average_forecaster.py` `app/features/forecasting/tests/test_seasonal_average_forecaster.py` `app/features/forecasting/tests/test_trend_regression_baseline_forecaster.py` `app/features/forecasting/tests/test_random_forest_forecaster.py` `app/features/forecasting/tests/test_feature_metadata.py`
Extend explainability to cover the new baseline models while explicitly guarding feature-aware models, and plumb new parameters through the explainability API.	Add WeightedMovingAverageExplainer, SeasonalAverageExplainer, and TrendRegressionBaselineExplainer, with logic mirroring the corresponding forecasters’ h=1 behavior and coefficient-based driver decomposition for the trend baseline. Extend explainer_factory and ExplainForecastRequest to support new model_type values and their extra parameters, and update ExplainabilityService._explain to forward these through. Guard feature-aware models (lightgbm, regression, xgboost, random_forest, prophet_like) in explainer_factory with a clear error, and add tests that factory routing and new explainers behave as expected and align with their forecasters.	`app/features/explainability/explainers.py` `app/features/explainability/schemas.py` `app/features/explainability/service.py` `app/features/explainability/tests/test_explainers.py`
Enhance backtesting metrics with RMSE and per-horizon-bucket aggregation, and expose the new structures through schemas and service behavior.	Add RMSE to MetricsCalculator, include it in calculate_all, and create HORIZON_BUCKETS plus compute_bucket_metrics and aggregate_bucket_metrics helpers for per-bucket metrics across folds. Update BacktestingService._run_model_backtest to compute per-fold horizon offsets, derive per-bucket metrics per fold, attach them to FoldResult, and compute bucketed_aggregated_metrics for ModelBacktestResult. Extend backtesting schemas to carry horizon_bucket_metrics on each fold and bucketed_aggregated_metrics on the model result, and update tests to assert RMSE presence and bucket metric shapes/aggregation behavior.	`app/features/backtesting/metrics.py` `app/features/backtesting/service.py` `app/features/backtesting/schemas.py` `app/features/backtesting/tests/test_metrics.py` `app/features/backtesting/tests/test_service.py`
Make the registry and ops layers feature-frame-version aware for duplicate detection, comparable-run selection, and alias/model health staleness classification.	Allow callers to pass runtime_info_extras on RunCreate, merge it into captured runtime_info, and extract feature_frame_version from it with sensible defaults for legacy runs. Include feature_frame_version in RegistryService._find_duplicate via a JSONB filter that treats missing keys as V1, and add a new find_comparable_runs method that enforces same grain, overlapping windows, same feature_frame_version, and SUCCESS status. Add computed fields feature_frame_version and feature_groups to RunResponse, and extend OpsService alias staleness logic with helper functions and a StaleReason enum to distinguish stale causes (including FEATURE_FRAME_VERSION_MISMATCH) while exposing alias/comparable feature_frame_version on AliasHealth and ModelHealthEntry. Update tests to cover feature_frame_version extraction, duplicate/comparable-run behavior, and the new stale-reason/V-mismatch paths.	`app/features/registry/schemas.py` `app/features/registry/service.py` `app/features/registry/tests/test_schemas.py` `app/features/registry/tests/test_service.py` `app/features/ops/schemas.py` `app/features/ops/service.py` `app/features/ops/tests/test_service.py`
Add a model-zoo comparison example script and update documentation and env examples to describe the expanded model set, backtesting contract, and governance rules.	Introduce examples/forecasting/model_zoo_compare.py, which calls /backtesting/run for all always-on and optional models, handling feature-flagged models gracefully and printing aggregate and bucketed WAPE comparisons. Document the expanded model zoo, new backtesting outputs (RMSE, horizon_bucket_metrics, bucketed_aggregated_metrics), and governance/comparable-run rules including feature_frame_version semantics in API_CONTRACTS, DOMAIN_MODEL, and optional-features docs. Expose the FORECAST_ENABLE_RANDOM_FOREST setting in .env.example so operators can toggle the new optional model.	`examples/forecasting/model_zoo_compare.py` `.env.example` `docs/_base/API_CONTRACTS.md` `docs/_base/DOMAIN_MODEL.md` `docs/optional-features/05-advanced-ml-model-zoo.md` `docs/optional-features/09-model-champion-challenger-governance.md`

Possibly linked issues

feat(forecast): model zoo and backtesting comparison (PRP-36) #302: The PR fully implements the PRP-36 model zoo and backtesting comparison features exactly as defined in the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

The seasonal/weighted moving average explainers reimplement parts of the corresponding forecaster logic (e.g. horizon=1 sampling and weight construction); consider factoring these into shared helpers or methods on the forecasters to keep behaviour changes in one place and avoid future drift.
TrendRegressionBaselineExplainer reconstructs the elapsed-day/DOW/month design row separately from TrendRegressionBaselineForecaster; it would be safer to reuse the forecaster’s design helpers (or a shared utility) so any future encoding changes remain consistent between training and explanation.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The seasonal/weighted moving average explainers reimplement parts of the corresponding forecaster logic (e.g. horizon=1 sampling and weight construction); consider factoring these into shared helpers or methods on the forecasters to keep behaviour changes in one place and avoid future drift.
- TrendRegressionBaselineExplainer reconstructs the elapsed-day/DOW/month design row separately from TrendRegressionBaselineForecaster; it would be safer to reuse the forecaster’s design helpers (or a shared utility) so any future encoding changes remain consistent between training and explanation.

## Individual Comments

### Comment 1
<location path="app/features/ops/service.py" line_range="138" />
<code_context>
     return "stable", delta


+def _run_feature_frame_version(run: ModelRun) -> int | None:
+    """Read ``feature_frame_version`` from ``run.runtime_info`` JSONB (PRP-36).
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Feature-frame version treats missing key as None here but as V=1 in filters, causing inconsistent staleness classification.

`_run_feature_frame_version` returns `None` when the key is missing, while `_feature_frame_version_filter` treats missing JSONB keys as `1`. As a result, legacy runs (no key) and runs with `feature_frame_version == 1` are considered equivalent in queries, but `_alias_staleness` and `get_model_health` will see `alias_v is None` vs `latest_v == 1` and report a `FEATURE_FRAME_VERSION_MISMATCH`. Please align the semantics by either normalising missing/invalid values to `1` here (or at callsites), or by changing `_feature_frame_version_filter` so that missing keys map to a distinct version instead of `1`.
</issue_to_address>

### Comment 2
<location path="app/features/explainability/explainers.py" line_range="280-289" />
<code_context>
+            arr = np.sort(arr)[1:-1]
+        forecast = float(arr.mean())
+        trim_note = " after trimming the min + max samples" if used_trim else ""
+        drivers = [
+            DriverContribution(
+                name="seasonal_window_mean",
+                feature_value=forecast,
+                contribution=forecast,
+                direction="positive",
+                description=(
+                    f"The forecast averages the values from the last {len(samples)} "
+                    f"matching seasonal positions (every {self.season_length} days){trim_note}."
+                ),
+            ),
+            DriverContribution(
+                name="sample_dispersion",
+                feature_value=float(np.std(samples)),
+                contribution=0.0,
+                direction="neutral",
</code_context>
<issue_to_address>
**suggestion:** SeasonalAverageExplainer mixes trimmed and untrimmed samples for dispersion, which may be inconsistent with the description.

When `trim_outliers` is enabled and `arr.size >= 4`, you trim min/max into `arr` for the forecast but still compute `sample_dispersion` from `np.std(samples)` (untrimmed). To avoid confusion, either compute dispersion from the trimmed `arr` as well, or update the description to state that dispersion uses the raw samples before trimming.

Suggested implementation:

```python
            DriverContribution(
                name="sample_dispersion",
                feature_value=float(np.std(arr if used_trim else samples)),
                contribution=0.0,
                direction="neutral",

```

This change assumes:
1. `arr` is defined as a NumPy array containing either the trimmed or untrimmed seasonal samples immediately before this `drivers` list.
2. `used_trim` is a boolean flag indicating whether trimming was applied (`True` when `trim_outliers` is enabled and `arr.size >= 4` before trimming).
If `used_trim` does not yet exist, you should set it where you trim: e.g., `used_trim = trim_outliers and arr.size >= 4` and then conditionally apply `arr = np.sort(arr)[1:-1]` when `used_trim` is `True`.
</issue_to_address>

### Comment 3
<location path="app/features/forecasting/tests/test_random_forest_forecaster.py" line_range="135-138" />
<code_context>
+        with pytest.raises(ValueError, match="n_estimators"):
+            RandomForestForecaster(n_estimators=0)
+
+    def test_invalid_max_depth_raises(self) -> None:
+        """max_depth below the minimum surfaces a clear error."""
+        with pytest.raises(ValueError, match="max_depth"):
+            RandomForestForecaster(max_depth=0)
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for `min_samples_leaf < 1` to cover all constructor validation branches

To round this out, please add a test asserting that `min_samples_leaf < 1` raises `ValueError` with the expected message, so the remaining constructor validation branch is covered and stays aligned with the implementation and docs.
</issue_to_address>

### Comment 4
<location path="docs/optional-features/09-model-champion-challenger-governance.md" line_range="30-32" />
<code_context>
+`OpsService.get_summary` uses the same predicate to classify staleness.
+When an alias's run has `V_a` and a newer comparable SUCCESS run has
+`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason
+`feature_frame_version_mismatch` (a distinct value from
+`newer_success_run`) so Slice C can render the mismatch separately.

</code_context>
<issue_to_address>
**issue (typo):** Use `stale_reason` consistently instead of `stale-reason`.

This doc uses `stale-reason` while other references (e.g., DOMAIN_MODEL.md) use the field name `stale_reason`. If `stale_reason` is canonical, please update this to match for consistency and to avoid confusion.

```suggestion
`OpsService.get_summary` uses the same predicate to classify staleness.
When an alias's run has `V_a` and a newer comparable SUCCESS run has
`V_b != V_a`, the alias is marked `is_stale=true` with `stale_reason`
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-26T06:13:30Z

    return "stable", delta


+def _run_feature_frame_version(run: ModelRun) -> int | None:


issue (bug_risk): Feature-frame version treats missing key as None here but as V=1 in filters, causing inconsistent staleness classification.

_run_feature_frame_version returns None when the key is missing, while _feature_frame_version_filter treats missing JSONB keys as 1. As a result, legacy runs (no key) and runs with feature_frame_version == 1 are considered equivalent in queries, but _alias_staleness and get_model_health will see alias_v is None vs latest_v == 1 and report a FEATURE_FRAME_VERSION_MISMATCH. Please align the semantics by either normalising missing/invalid values to 1 here (or at callsites), or by changing _feature_frame_version_filter so that missing keys map to a distinct version instead of 1.

sourcery-ai · 2026-05-26T06:13:31Z

+        drivers = [
+            DriverContribution(
+                name="weighted_window_mean",
+                feature_value=forecast,
+                contribution=forecast,
+                direction="positive",
+                description=(
+                    f"The forecast is the {self.weight_strategy}-weighted mean of the "
+                    f"last {self.window_size} observed values"
+                    + (f" (decay={self.decay})." if self.weight_strategy == "exponential" else ".")


suggestion: SeasonalAverageExplainer mixes trimmed and untrimmed samples for dispersion, which may be inconsistent with the description.

When trim_outliers is enabled and arr.size >= 4, you trim min/max into arr for the forecast but still compute sample_dispersion from np.std(samples) (untrimmed). To avoid confusion, either compute dispersion from the trimmed arr as well, or update the description to state that dispersion uses the raw samples before trimming.

Suggested implementation:

DriverContribution( name="sample_dispersion", feature_value=float(np.std(arr if used_trim else samples)), contribution=0.0, direction="neutral",

This change assumes:

arr is defined as a NumPy array containing either the trimmed or untrimmed seasonal samples immediately before this drivers list.

used_trim is a boolean flag indicating whether trimming was applied (True when trim_outliers is enabled and arr.size >= 4 before trimming).
If used_trim does not yet exist, you should set it where you trim: e.g., used_trim = trim_outliers and arr.size >= 4 and then conditionally apply arr = np.sort(arr)[1:-1] when used_trim is True.

sourcery-ai · 2026-05-26T06:13:31Z

+    def test_invalid_max_depth_raises(self) -> None:
+        """max_depth below the minimum surfaces a clear error."""
+        with pytest.raises(ValueError, match="max_depth"):
+            RandomForestForecaster(max_depth=0)


suggestion (testing): Add a test for min_samples_leaf < 1 to cover all constructor validation branches

To round this out, please add a test asserting that min_samples_leaf < 1 raises ValueError with the expected message, so the remaining constructor validation branch is covered and stays aligned with the implementation and docs.

sourcery-ai · 2026-05-26T06:13:31Z

+`OpsService.get_summary` uses the same predicate to classify staleness.
+When an alias's run has `V_a` and a newer comparable SUCCESS run has
+`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason


issue (typo): Use stale_reason consistently instead of stale-reason.

This doc uses stale-reason while other references (e.g., DOMAIN_MODEL.md) use the field name stale_reason. If stale_reason is canonical, please update this to match for consistency and to avoid confusion.

Suggested change

`OpsService.get_summary` uses the same predicate to classify staleness.

When an alias's run has `V_a` and a newer comparable SUCCESS run has

`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason

`OpsService.get_summary` uses the same predicate to classify staleness.

When an alias's run has `V_a` and a newer comparable SUCCESS run has

`V_b != V_a`, the alias is marked `is_stale=true` with `stale_reason`

CodeRabbit review on PR #303 surfaced one bug-risk + one consistency issue + one missing test + one doc typo + an overall refactor request. All five addressed. 1. BUG-RISK — _run_feature_frame_version returned None for missing JSONB keys while _feature_frame_version_filter treats them as V=1. _alias_staleness compared None != 1 and spuriously surfaced FEATURE_FRAME_VERSION_MISMATCH for a legacy alias against an explicit-V=1 comparable run. Normalized the ops helper to return V=1 for missing keys (matches the registry filter contract). The schema-side RunResponse.feature_frame_version still surfaces None so UIs can distinguish "no V info" from "V=1". 2. REFACTOR — Extracted shared pure helpers in forecasting/models.py: - compute_weighted_average_weights - compute_seasonal_average_for_offset - build_trend_baseline_design_row The forecasters' fit/predict + the three new explainers now call them as the single source of truth. No more two-place drift risk when a default changes. 3. CONSISTENCY — SeasonalAverageExplainer.sample_dispersion now measures the same array the forecast was averaged from (post-trim when trim_outliers is on; raw otherwise). Description updated to match. 4. TESTING — Added test_invalid_min_samples_leaf_raises to round out RandomForestForecaster's constructor-validation branches. 5. TYPO — docs/optional-features/09-…governance.md uses the `stale_reason` field-name form (no hyphen) to match DOMAIN_MODEL.md / API_CONTRACTS.md. Plus: two new ops tests pin the new V=1 normalization contract (`_run_feature_frame_version_rejects_unsupported_value`, `_alias_staleness_legacy_run_treated_as_v1_no_spurious_mismatch`). Validation: ruff / mypy --strict / pyright --strict clean (same 3+8 pre-existing xgboost/lightgbm errors only). 1577 non-integration tests pass (+3 new). Leakage specs unchanged.

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

w7-mgfcode merged commit 0e2ad9e into dev May 26, 2026
10 checks passed

w7-mgfcode deleted the feat/forecast-model-zoo-and-backtesting branch May 26, 2026 06:31

w7-mgfcode mentioned this pull request May 26, 2026

docs(prp): refresh prp37 after model zoo contracts #304

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(forecast): add model zoo and backtesting comparison#303

feat(forecast): add model zoo and backtesting comparison#303
w7-mgfcode merged 2 commits into
devfrom
feat/forecast-model-zoo-and-backtesting

w7-mgfcode commented May 26, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review skipped

Uh oh!

sourcery-ai Bot commented May 26, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 26, 2026

Uh oh!

sourcery-ai Bot May 26, 2026

Uh oh!

sourcery-ai Bot May 26, 2026

Uh oh!

sourcery-ai Bot May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return "stable", delta


		def _run_feature_frame_version(run: ModelRun) -> int \| None:

Conversation

w7-mgfcode commented May 26, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implemented PRP-36 Tasks

Explicitly Deferred

Deviations from the PRP

Validation Results

Files Changed

Stash Status

Test Plan

Summary by Sourcery

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

sourcery-ai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for backtesting with bucketed metrics

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

w7-mgfcode commented May 26, 2026 •

edited by sourcery-ai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading

sourcery-ai Bot commented May 26, 2026 •

edited

Loading