Skip to content

feat(forecast): add model zoo and backtesting comparison#303

Merged
w7-mgfcode merged 2 commits into
devfrom
feat/forecast-model-zoo-and-backtesting
May 26, 2026
Merged

feat(forecast): add model zoo and backtesting comparison#303
w7-mgfcode merged 2 commits into
devfrom
feat/forecast-model-zoo-and-backtesting

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

@w7-mgfcode w7-mgfcode commented May 26, 2026

Summary

PRP-36 (Forecast Intelligence — Slice B) on top of the PRP-35 Feature Frame V2 contract. Promotes the model layer from "a regression model + three baselines" to a disciplined model zoo with fair, leakage-safe comparison.

Closes #302.

Implemented PRP-36 Tasks

# Task Status
1 Contract Refresh — re-probe PRP-35 surface against current dev
2 WeightedMovingAverageForecaster (linear + exponential, always-on)
3 SeasonalAverageForecaster (N-cycle average, optional outlier-trim)
4 TrendRegressionBaselineForecaster (Ridge over elapsed-day + dow/month)
5 RandomForestForecaster (sklearn, n_jobs=1, gated by forecast_enable_random_forest)
6 Confirm existing feature-aware defaults — no bundle_hash impact
7 RMSE + HORIZON_BUCKETS + compute_bucket_metrics + aggregate_bucket_metrics
8 Wire bucket metrics into backtest service (V1 fold path only — V2 deferred to #299)
9 Backtesting schemas — FoldResult.horizon_bucket_metrics, bucketed_aggregated_metrics
10 RegistryService._find_duplicate + find_comparable_runs V-aware
11 RunResponse.feature_frame_version + feature_groups computed from runtime_info
12 OpsService V-mismatch stale-reason path
13 StaleReason enum + V exposure on AliasHealth / ModelHealthEntry
14 Explainers for new baselines (weighted_moving_average, seasonal_average, trend_regression_baseline)
15 Test updates — feature_metadata, metrics, backtest service, registry, ops, explainability
16 examples/forecasting/model_zoo_compare.py diagnostic script
17 Docs (API_CONTRACTS, DOMAIN_MODEL, optional-features 05 + 09)
18 alembic check — no new migration

Explicitly Deferred

  • V2 backtesting dispatchbacktesting/service.py keeps using the V1 builders. The redesigned surface (additive feature_frame_version + feature_groups on BacktestConfig + V2 fold feature construction + loader sharing) is tracked at feat(forecast): add feature frame v2 (PRP-35) #299 per the PR feat(forecast): add feature frame v2 #300 deferral.
  • PRP-37 UI / dashboard — Slice C, follows this PR.
  • /explain/forecast for random_forest — needs bundle reload; outside PRP-36 scope.

Deviations from the PRP

  • The PRP cites "REPLACE comparable-run selection with await registry_service.find_comparable_runs(...)" inside OpsService.get_summary. The existing batch query (one SQL with DISTINCT ON (store_id, product_id)) is significantly more efficient than per-grain find_comparable_runs calls. I kept the batch query and added V-aware classification on top of it — the canonical find_comparable_runs is still defined on RegistryService for future callers (Slice C will use it from REST).
  • The PRP CHANGELOG entry instruction is moot — CHANGELOG.md is release-please-managed and picks up the commit subject automatically.

Validation Results

✅ uv run ruff check .          — All checks passed
✅ uv run ruff format --check . — 332 files already formatted
✅ uv run mypy app/             — 3 pre-existing xgboost errors only (CI uses --all-extras)
✅ uv run pyright app/          — 8 pre-existing lightgbm/xgboost errors only (same)
✅ uv run pytest -m "not integration"
                                — 1574 passed, 12 skipped, 264 deselected
✅ All four load-bearing leakage specs unchanged (62 tests green)
✅ uv run alembic check         — local DB at stale revision (pre-existing); no migration files added (0-diff under alembic/)

Files Changed

31 files changed, 2820 insertions(+), 41 deletions(-)

backend code:
  M app/core/config.py                                       +1   (forecast_enable_random_forest flag)
  M app/features/forecasting/models.py                       +375 (4 new forecaster classes + factory)
  M app/features/forecasting/schemas.py                      +140 (4 new ModelConfig + union extension)
  M app/features/forecasting/feature_metadata.py             +4   (_MODEL_FAMILY_MAP coverage)
  M app/features/backtesting/metrics.py                      +156 (RMSE + HORIZON_BUCKETS + bucket helpers)
  M app/features/backtesting/schemas.py                      +22  (horizon_bucket_metrics + bucketed_aggregated_metrics)
  M app/features/backtesting/service.py                      +28  (bucket-metric emission in fold loop)
  M app/features/registry/schemas.py                         +50  (RunCreate.runtime_info_extras + RunResponse computed V/groups)
  M app/features/registry/service.py                         +110 (V-aware _find_duplicate + find_comparable_runs)
  M app/features/ops/schemas.py                              +50  (StaleReason enum + V exposure fields)
  M app/features/ops/service.py                              +60  (V-mismatch staleness path)
  M app/features/explainability/explainers.py                +280 (3 new explainer classes + extended factory)
  M app/features/explainability/schemas.py                   +35  (request-schema extensions)
  M app/features/explainability/service.py                   +30  (plumb new params through _explain)

tests:
  A app/features/forecasting/tests/test_weighted_moving_average_forecaster.py
  A app/features/forecasting/tests/test_seasonal_average_forecaster.py
  A app/features/forecasting/tests/test_trend_regression_baseline_forecaster.py
  A app/features/forecasting/tests/test_random_forest_forecaster.py
  M app/features/forecasting/tests/test_feature_metadata.py  (new model_type coverage)
  M app/features/backtesting/tests/test_metrics.py           (TestRMSE + TestComputeBucketMetrics + TestAggregateBucketMetrics)
  M app/features/backtesting/tests/test_service.py           (bucket-metric shape on FoldResult)
  M app/features/registry/tests/test_service.py              (TestRegistryServiceFeatureFrameVersion)
  M app/features/registry/tests/test_schemas.py              (TestRunResponseFeatureFrameVersion)
  M app/features/ops/tests/test_service.py                   (PRP-36 _alias_staleness V-mismatch path)
  M app/features/explainability/tests/test_explainers.py     (new explainer routing + arithmetic checks)

example + docs + env:
  A examples/forecasting/model_zoo_compare.py
  M .env.example                                             (FORECAST_ENABLE_RANDOM_FOREST hint)
  M docs/_base/API_CONTRACTS.md                              (backtest + registry response shape)
  M docs/_base/DOMAIN_MODEL.md                               (comparable-run rule)
  M docs/optional-features/05-advanced-ml-model-zoo.md
  M docs/optional-features/09-model-champion-challenger-governance.md

Stash Status

stash@{0}: On dev: local qwen3 rag demo changes before prp-35untouched throughout PRP-36 execution.

Test Plan

  • CI green on dev (ruff + mypy + pyright + pytest + migration-check)
  • Integration tests pass on PR's Postgres service (pytest -m integration)
  • Spot-check POST /backtesting/run response shape includes aggregated_metrics.rmse and main_model_results.bucketed_aggregated_metrics
  • Spot-check GET /registry/runs/{id} response shape carries feature_frame_version + feature_groups (or null for legacy runs)
  • Optional: run examples/forecasting/model_zoo_compare.py against the local seeded DB

Summary by Sourcery

Introduce new forecasting baselines and an optional Random Forest model, extend backtesting metrics with RMSE and horizon buckets, and make registry/ops version-aware for feature frame governance and explainability.

New Features:

  • Add weighted moving average, seasonal average, and trend regression baseline forecasters with corresponding configs and explainers.
  • Introduce an optional RandomForest-based forecaster behind a feature flag and integrate it into the forecasting model factory and model zoo diagnostic script.
  • Expose richer explainability API options and new baseline explainers aligned with the added models.

Enhancements:

  • Extend backtesting metrics to include RMSE and per-horizon-bucket calculations, aggregating these across folds and surfacing them in backtest results.
  • Make registry run creation, duplicate detection, and comparable-run queries aware of feature_frame_version and feature groups without adding new migrations.
  • Enhance ops health and staleness reporting to classify aliases using structured stale reasons and feature frame version mismatches.
  • Update feature metadata, API contracts, and domain docs to reflect the expanded model zoo, backtesting outputs, and governance rules.

Tests:

  • Add focused unit tests for the new forecasters, explainers, backtesting bucket metrics, registry feature frame handling, and ops staleness logic.
  • Include an example script to compare all available models via backtesting without changing core APIs.

PRP-36 (Forecast Intelligence — Slice B). Promote the model layer from
"a regression model + three baselines" to a disciplined model zoo with
fair, leakage-safe comparison on top of the PRP-35 Feature Frame V2
contract.

Models (under model_factory + _MODEL_FAMILY_MAP):
- weighted_moving_average — linear or exponential weighting (always-on)
- seasonal_average — average of last N seasonal cycles, optional trim (always-on)
- trend_regression_baseline — Ridge over elapsed-day + dow/month one-hots (always-on)
- random_forest — sklearn RandomForestRegressor, n_jobs=1, gated by forecast_enable_random_forest

Backtesting metrics (additive — V1 fold path only):
- aggregated_metrics gains rmse alongside mae/smape/wape/bias
- FoldResult.horizon_bucket_metrics — per-bucket dict keyed by h_1_7 / h_8_14 / h_15_28 / h_29_plus (empty buckets dropped)
- ModelBacktestResult.bucketed_aggregated_metrics — per-bucket means across folds
- V2 backtesting dispatch remains DEFERRED to #299

Registry comparable-run rule:
- RegistryService._find_duplicate now distinguishes V1 vs V2 (runs with different feature_frame_version are NOT duplicates)
- New find_comparable_runs(grain, overlapping window, same V, status==SUCCESS)
- RunCreate.runtime_info_extras lets callers pin feature_frame_version + feature_groups
- RunResponse.feature_frame_version + feature_groups computed from runtime_info (legacy runs surface None)

Ops staleness:
- New StaleReason enum value FEATURE_FRAME_VERSION_MISMATCH — a V1 alias with a newer V2 comparable run reports this instead of NEWER_SUCCESS_RUN
- AliasHealth and ModelHealthEntry expose alias_feature_frame_version + comparable_run_feature_frame_version

Explainability:
- New explainers: WeightedMovingAverageExplainer, SeasonalAverageExplainer, TrendRegressionBaselineExplainer
- Factory + service plumb weight_strategy / decay / lookback_cycles / trim_outliers
- HGBR keeps raising FeatureImportanceUnavailableError (422 path unchanged)

Other:
- examples/forecasting/model_zoo_compare.py — read-only diagnostic that backtests every available model on a single grain and prints aggregate + per-bucket WAPE
- docs/_base/API_CONTRACTS.md, DOMAIN_MODEL.md, docs/optional-features/05 + 09 updated

Validation:
- ruff check / format clean
- mypy --strict / pyright --strict clean (3 mypy + 8 pyright pre-existing xgboost/lightgbm errors only; CI runs --all-extras)
- 1574 non-integration tests pass; load-bearing leakage specs unchanged
- alembic check — NO new migration (all new state rides existing JSONB columns)

Out of scope (deferred):
- V2 backtesting fold dispatch — #299
- PRP-37 UI / dashboard — Slice C
- /explain/forecast handler for random_forest — needs bundle reload, separate PRP
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52c58852-a03a-4024-b61d-6991653bcabd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/forecast-model-zoo-and-backtesting

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 26, 2026

Reviewer's Guide

Introduces a richer forecasting model zoo (three new target-only baselines plus an optional RandomForest forecaster), adds RMSE and horizon-bucket metrics to backtesting, and wires feature-frame-version awareness into registry and ops so runs and aliases can be compared fairly across model and feature-frame versions, with matching explainability, tests, example script, and docs updates.

Sequence diagram for backtesting with bucketed metrics

sequenceDiagram
    actor Client
    participant BacktestingService as BacktestingService
    participant MetricsCalculator as MetricsCalculator
    participant BucketMetrics as compute_bucket_metrics

    Client->>BacktestingService: POST /backtesting/run
    loop per_fold
        BacktestingService->>BacktestingService: model.predict(horizon, X_test)
        BacktestingService->>MetricsCalculator: calculate_all(actuals, predictions)
        MetricsCalculator-->>BacktestingService: metrics{mae,rmse,smape,wape,bias}
        BacktestingService->>BucketMetrics: compute_bucket_metrics(actuals, predictions, horizon_offsets)
        BucketMetrics-->>BacktestingService: horizon_bucket_metrics
    end
    BacktestingService->>MetricsCalculator: aggregate_fold_metrics(fold_metrics)
    MetricsCalculator-->>BacktestingService: aggregated_metrics, metric_std
    BacktestingService->>MetricsCalculator: aggregate_bucket_metrics(fold_bucket_metrics)
    MetricsCalculator-->>BacktestingService: bucketed_aggregated_metrics
    BacktestingService-->>Client: ModelBacktestResult(aggregated_metrics.rmse, fold_results.horizon_bucket_metrics, bucketed_aggregated_metrics)
Loading

File-Level Changes

Change Details Files
Add four new forecaster implementations and wire them into the forecasting model factory and configs, including an optional feature-aware RandomForest model behind a feature flag.
  • Implement WeightedMovingAverageForecaster, SeasonalAverageForecaster, TrendRegressionBaselineForecaster, and RandomForestForecaster with sklearn-style fit/predict/param APIs and deterministic behavior.
  • Extend ModelType literal, feature_metadata model-family mapping, and forecasting schemas with new ModelConfig types (including RandomForestModelConfig) and add them to the discriminated union.
  • Update model_factory to instantiate the new forecasters, enforcing the forecast_enable_random_forest flag for RandomForest and validating config types.
app/features/forecasting/models.py
app/features/forecasting/schemas.py
app/features/forecasting/feature_metadata.py
app/core/config.py
app/features/forecasting/tests/test_weighted_moving_average_forecaster.py
app/features/forecasting/tests/test_seasonal_average_forecaster.py
app/features/forecasting/tests/test_trend_regression_baseline_forecaster.py
app/features/forecasting/tests/test_random_forest_forecaster.py
app/features/forecasting/tests/test_feature_metadata.py
Extend explainability to cover the new baseline models while explicitly guarding feature-aware models, and plumb new parameters through the explainability API.
  • Add WeightedMovingAverageExplainer, SeasonalAverageExplainer, and TrendRegressionBaselineExplainer, with logic mirroring the corresponding forecasters’ h=1 behavior and coefficient-based driver decomposition for the trend baseline.
  • Extend explainer_factory and ExplainForecastRequest to support new model_type values and their extra parameters, and update ExplainabilityService._explain to forward these through.
  • Guard feature-aware models (lightgbm, regression, xgboost, random_forest, prophet_like) in explainer_factory with a clear error, and add tests that factory routing and new explainers behave as expected and align with their forecasters.
app/features/explainability/explainers.py
app/features/explainability/schemas.py
app/features/explainability/service.py
app/features/explainability/tests/test_explainers.py
Enhance backtesting metrics with RMSE and per-horizon-bucket aggregation, and expose the new structures through schemas and service behavior.
  • Add RMSE to MetricsCalculator, include it in calculate_all, and create HORIZON_BUCKETS plus compute_bucket_metrics and aggregate_bucket_metrics helpers for per-bucket metrics across folds.
  • Update BacktestingService._run_model_backtest to compute per-fold horizon offsets, derive per-bucket metrics per fold, attach them to FoldResult, and compute bucketed_aggregated_metrics for ModelBacktestResult.
  • Extend backtesting schemas to carry horizon_bucket_metrics on each fold and bucketed_aggregated_metrics on the model result, and update tests to assert RMSE presence and bucket metric shapes/aggregation behavior.
app/features/backtesting/metrics.py
app/features/backtesting/service.py
app/features/backtesting/schemas.py
app/features/backtesting/tests/test_metrics.py
app/features/backtesting/tests/test_service.py
Make the registry and ops layers feature-frame-version aware for duplicate detection, comparable-run selection, and alias/model health staleness classification.
  • Allow callers to pass runtime_info_extras on RunCreate, merge it into captured runtime_info, and extract feature_frame_version from it with sensible defaults for legacy runs.
  • Include feature_frame_version in RegistryService._find_duplicate via a JSONB filter that treats missing keys as V1, and add a new find_comparable_runs method that enforces same grain, overlapping windows, same feature_frame_version, and SUCCESS status.
  • Add computed fields feature_frame_version and feature_groups to RunResponse, and extend OpsService alias staleness logic with helper functions and a StaleReason enum to distinguish stale causes (including FEATURE_FRAME_VERSION_MISMATCH) while exposing alias/comparable feature_frame_version on AliasHealth and ModelHealthEntry.
  • Update tests to cover feature_frame_version extraction, duplicate/comparable-run behavior, and the new stale-reason/V-mismatch paths.
app/features/registry/schemas.py
app/features/registry/service.py
app/features/registry/tests/test_schemas.py
app/features/registry/tests/test_service.py
app/features/ops/schemas.py
app/features/ops/service.py
app/features/ops/tests/test_service.py
Add a model-zoo comparison example script and update documentation and env examples to describe the expanded model set, backtesting contract, and governance rules.
  • Introduce examples/forecasting/model_zoo_compare.py, which calls /backtesting/run for all always-on and optional models, handling feature-flagged models gracefully and printing aggregate and bucketed WAPE comparisons.
  • Document the expanded model zoo, new backtesting outputs (RMSE, horizon_bucket_metrics, bucketed_aggregated_metrics), and governance/comparable-run rules including feature_frame_version semantics in API_CONTRACTS, DOMAIN_MODEL, and optional-features docs.
  • Expose the FORECAST_ENABLE_RANDOM_FOREST setting in .env.example so operators can toggle the new optional model.
examples/forecasting/model_zoo_compare.py
.env.example
docs/_base/API_CONTRACTS.md
docs/_base/DOMAIN_MODEL.md
docs/optional-features/05-advanced-ml-model-zoo.md
docs/optional-features/09-model-champion-challenger-governance.md

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The seasonal/weighted moving average explainers reimplement parts of the corresponding forecaster logic (e.g. horizon=1 sampling and weight construction); consider factoring these into shared helpers or methods on the forecasters to keep behaviour changes in one place and avoid future drift.
  • TrendRegressionBaselineExplainer reconstructs the elapsed-day/DOW/month design row separately from TrendRegressionBaselineForecaster; it would be safer to reuse the forecaster’s design helpers (or a shared utility) so any future encoding changes remain consistent between training and explanation.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The seasonal/weighted moving average explainers reimplement parts of the corresponding forecaster logic (e.g. horizon=1 sampling and weight construction); consider factoring these into shared helpers or methods on the forecasters to keep behaviour changes in one place and avoid future drift.
- TrendRegressionBaselineExplainer reconstructs the elapsed-day/DOW/month design row separately from TrendRegressionBaselineForecaster; it would be safer to reuse the forecaster’s design helpers (or a shared utility) so any future encoding changes remain consistent between training and explanation.

## Individual Comments

### Comment 1
<location path="app/features/ops/service.py" line_range="138" />
<code_context>
     return "stable", delta


+def _run_feature_frame_version(run: ModelRun) -> int | None:
+    """Read ``feature_frame_version`` from ``run.runtime_info`` JSONB (PRP-36).
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Feature-frame version treats missing key as None here but as V=1 in filters, causing inconsistent staleness classification.

`_run_feature_frame_version` returns `None` when the key is missing, while `_feature_frame_version_filter` treats missing JSONB keys as `1`. As a result, legacy runs (no key) and runs with `feature_frame_version == 1` are considered equivalent in queries, but `_alias_staleness` and `get_model_health` will see `alias_v is None` vs `latest_v == 1` and report a `FEATURE_FRAME_VERSION_MISMATCH`. Please align the semantics by either normalising missing/invalid values to `1` here (or at callsites), or by changing `_feature_frame_version_filter` so that missing keys map to a distinct version instead of `1`.
</issue_to_address>

### Comment 2
<location path="app/features/explainability/explainers.py" line_range="280-289" />
<code_context>
+            arr = np.sort(arr)[1:-1]
+        forecast = float(arr.mean())
+        trim_note = " after trimming the min + max samples" if used_trim else ""
+        drivers = [
+            DriverContribution(
+                name="seasonal_window_mean",
+                feature_value=forecast,
+                contribution=forecast,
+                direction="positive",
+                description=(
+                    f"The forecast averages the values from the last {len(samples)} "
+                    f"matching seasonal positions (every {self.season_length} days){trim_note}."
+                ),
+            ),
+            DriverContribution(
+                name="sample_dispersion",
+                feature_value=float(np.std(samples)),
+                contribution=0.0,
+                direction="neutral",
</code_context>
<issue_to_address>
**suggestion:** SeasonalAverageExplainer mixes trimmed and untrimmed samples for dispersion, which may be inconsistent with the description.

When `trim_outliers` is enabled and `arr.size >= 4`, you trim min/max into `arr` for the forecast but still compute `sample_dispersion` from `np.std(samples)` (untrimmed). To avoid confusion, either compute dispersion from the trimmed `arr` as well, or update the description to state that dispersion uses the raw samples before trimming.

Suggested implementation:

```python
            DriverContribution(
                name="sample_dispersion",
                feature_value=float(np.std(arr if used_trim else samples)),
                contribution=0.0,
                direction="neutral",

```

This change assumes:
1. `arr` is defined as a NumPy array containing either the trimmed or untrimmed seasonal samples immediately before this `drivers` list.
2. `used_trim` is a boolean flag indicating whether trimming was applied (`True` when `trim_outliers` is enabled and `arr.size >= 4` before trimming).
If `used_trim` does not yet exist, you should set it where you trim: e.g., `used_trim = trim_outliers and arr.size >= 4` and then conditionally apply `arr = np.sort(arr)[1:-1]` when `used_trim` is `True`.
</issue_to_address>

### Comment 3
<location path="app/features/forecasting/tests/test_random_forest_forecaster.py" line_range="135-138" />
<code_context>
+        with pytest.raises(ValueError, match="n_estimators"):
+            RandomForestForecaster(n_estimators=0)
+
+    def test_invalid_max_depth_raises(self) -> None:
+        """max_depth below the minimum surfaces a clear error."""
+        with pytest.raises(ValueError, match="max_depth"):
+            RandomForestForecaster(max_depth=0)
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for `min_samples_leaf < 1` to cover all constructor validation branches

To round this out, please add a test asserting that `min_samples_leaf < 1` raises `ValueError` with the expected message, so the remaining constructor validation branch is covered and stays aligned with the implementation and docs.
</issue_to_address>

### Comment 4
<location path="docs/optional-features/09-model-champion-challenger-governance.md" line_range="30-32" />
<code_context>
+`OpsService.get_summary` uses the same predicate to classify staleness.
+When an alias's run has `V_a` and a newer comparable SUCCESS run has
+`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason
+`feature_frame_version_mismatch` (a distinct value from
+`newer_success_run`) so Slice C can render the mismatch separately.

</code_context>
<issue_to_address>
**issue (typo):** Use `stale_reason` consistently instead of `stale-reason`.

This doc uses `stale-reason` while other references (e.g., DOMAIN_MODEL.md) use the field name `stale_reason`. If `stale_reason` is canonical, please update this to match for consistency and to avoid confusion.

```suggestion
`OpsService.get_summary` uses the same predicate to classify staleness.
When an alias's run has `V_a` and a newer comparable SUCCESS run has
`V_b != V_a`, the alias is marked `is_stale=true` with `stale_reason`
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread app/features/ops/service.py Outdated
return "stable", delta


def _run_feature_frame_version(run: ModelRun) -> int | None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Feature-frame version treats missing key as None here but as V=1 in filters, causing inconsistent staleness classification.

_run_feature_frame_version returns None when the key is missing, while _feature_frame_version_filter treats missing JSONB keys as 1. As a result, legacy runs (no key) and runs with feature_frame_version == 1 are considered equivalent in queries, but _alias_staleness and get_model_health will see alias_v is None vs latest_v == 1 and report a FEATURE_FRAME_VERSION_MISMATCH. Please align the semantics by either normalising missing/invalid values to 1 here (or at callsites), or by changing _feature_frame_version_filter so that missing keys map to a distinct version instead of 1.

Comment on lines +280 to +289
drivers = [
DriverContribution(
name="weighted_window_mean",
feature_value=forecast,
contribution=forecast,
direction="positive",
description=(
f"The forecast is the {self.weight_strategy}-weighted mean of the "
f"last {self.window_size} observed values"
+ (f" (decay={self.decay})." if self.weight_strategy == "exponential" else ".")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: SeasonalAverageExplainer mixes trimmed and untrimmed samples for dispersion, which may be inconsistent with the description.

When trim_outliers is enabled and arr.size >= 4, you trim min/max into arr for the forecast but still compute sample_dispersion from np.std(samples) (untrimmed). To avoid confusion, either compute dispersion from the trimmed arr as well, or update the description to state that dispersion uses the raw samples before trimming.

Suggested implementation:

            DriverContribution(
                name="sample_dispersion",
                feature_value=float(np.std(arr if used_trim else samples)),
                contribution=0.0,
                direction="neutral",

This change assumes:

  1. arr is defined as a NumPy array containing either the trimmed or untrimmed seasonal samples immediately before this drivers list.
  2. used_trim is a boolean flag indicating whether trimming was applied (True when trim_outliers is enabled and arr.size >= 4 before trimming).
    If used_trim does not yet exist, you should set it where you trim: e.g., used_trim = trim_outliers and arr.size >= 4 and then conditionally apply arr = np.sort(arr)[1:-1] when used_trim is True.

Comment on lines +135 to +138
def test_invalid_max_depth_raises(self) -> None:
"""max_depth below the minimum surfaces a clear error."""
with pytest.raises(ValueError, match="max_depth"):
RandomForestForecaster(max_depth=0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test for min_samples_leaf < 1 to cover all constructor validation branches

To round this out, please add a test asserting that min_samples_leaf < 1 raises ValueError with the expected message, so the remaining constructor validation branch is covered and stays aligned with the implementation and docs.

Comment on lines +30 to +32
`OpsService.get_summary` uses the same predicate to classify staleness.
When an alias's run has `V_a` and a newer comparable SUCCESS run has
`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Use stale_reason consistently instead of stale-reason.

This doc uses stale-reason while other references (e.g., DOMAIN_MODEL.md) use the field name stale_reason. If stale_reason is canonical, please update this to match for consistency and to avoid confusion.

Suggested change
`OpsService.get_summary` uses the same predicate to classify staleness.
When an alias's run has `V_a` and a newer comparable SUCCESS run has
`V_b != V_a`, the alias is marked `is_stale=true` with stale-reason
`OpsService.get_summary` uses the same predicate to classify staleness.
When an alias's run has `V_a` and a newer comparable SUCCESS run has
`V_b != V_a`, the alias is marked `is_stale=true` with `stale_reason`

CodeRabbit review on PR #303 surfaced one bug-risk + one consistency
issue + one missing test + one doc typo + an overall refactor request.
All five addressed.

1. BUG-RISK — _run_feature_frame_version returned None for missing
   JSONB keys while _feature_frame_version_filter treats them as V=1.
   _alias_staleness compared None != 1 and spuriously surfaced
   FEATURE_FRAME_VERSION_MISMATCH for a legacy alias against an
   explicit-V=1 comparable run. Normalized the ops helper to return
   V=1 for missing keys (matches the registry filter contract). The
   schema-side RunResponse.feature_frame_version still surfaces None
   so UIs can distinguish "no V info" from "V=1".

2. REFACTOR — Extracted shared pure helpers in forecasting/models.py:
   - compute_weighted_average_weights
   - compute_seasonal_average_for_offset
   - build_trend_baseline_design_row
   The forecasters' fit/predict + the three new explainers now call
   them as the single source of truth. No more two-place drift risk
   when a default changes.

3. CONSISTENCY — SeasonalAverageExplainer.sample_dispersion now
   measures the same array the forecast was averaged from
   (post-trim when trim_outliers is on; raw otherwise). Description
   updated to match.

4. TESTING — Added test_invalid_min_samples_leaf_raises to round
   out RandomForestForecaster's constructor-validation branches.

5. TYPO — docs/optional-features/09-…governance.md uses the
   `stale_reason` field-name form (no hyphen) to match
   DOMAIN_MODEL.md / API_CONTRACTS.md.

Plus: two new ops tests pin the new V=1 normalization contract
(`_run_feature_frame_version_rejects_unsupported_value`,
`_alias_staleness_legacy_run_treated_as_v1_no_spurious_mismatch`).

Validation: ruff / mypy --strict / pyright --strict clean (same 3+8
pre-existing xgboost/lightgbm errors only). 1577 non-integration
tests pass (+3 new). Leakage specs unchanged.
@w7-mgfcode w7-mgfcode merged commit 0e2ad9e into dev May 26, 2026
10 checks passed
@w7-mgfcode w7-mgfcode deleted the feat/forecast-model-zoo-and-backtesting branch May 26, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant