From 97c13f878c5c7234bfe125b4b65b0c1468d21bd6 Mon Sep 17 00:00:00 2001 From: Gabor Szabo Date: Tue, 26 May 2026 05:51:05 +0200 Subject: [PATCH] docs(docs): add forecast intelligence planning docs (#295) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lands the docs-only Forecast Intelligence roadmap — 4 INITIAL docs + 3 PRPs, no production code. Dependency-chained execution: PRP-35 first, PRP-36 + PRP-37 follow. Tracked by epic issue #295. - INITIAL roadmap (A/B/C + index) - PRP-35 Feature Frame V2 — V1 frozen, V2 ships as sibling builders, dispatch at service layer only, load-bearing leakage spec - PRP-36 Model Zoo + Backtesting — new baselines, per-horizon-bucket metrics, comparable-runs with feature_frame_version key - PRP-37 Interactive UI — partial-execution gates, shadcn@4.7.0 pin, per-component @radix-ui/react-X imports --- ...orecast-intelligence-A-feature-frame-v2.md | 245 +++ ...st-intelligence-B-model-zoo-backtesting.md | 233 +++ ...-forecast-intelligence-C-interactive-ui.md | 280 ++++ .../INITIAL-forecast-intelligence-index.md | 217 +++ ...orecast-intelligence-A-feature-frame-v2.md | 1103 ++++++++++++++ ...st-intelligence-B-model-zoo-backtesting.md | 1356 +++++++++++++++++ ...-forecast-intelligence-C-interactive-ui.md | 1221 +++++++++++++++ 7 files changed, 4655 insertions(+) create mode 100644 PRPs/INITIAL/INITIAL-forecast-intelligence-A-feature-frame-v2.md create mode 100644 PRPs/INITIAL/INITIAL-forecast-intelligence-B-model-zoo-backtesting.md create mode 100644 PRPs/INITIAL/INITIAL-forecast-intelligence-C-interactive-ui.md create mode 100644 PRPs/INITIAL/INITIAL-forecast-intelligence-index.md create mode 100644 PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md create mode 100644 PRPs/PRP-36-forecast-intelligence-B-model-zoo-backtesting.md create mode 100644 PRPs/PRP-37-forecast-intelligence-C-interactive-ui.md diff --git a/PRPs/INITIAL/INITIAL-forecast-intelligence-A-feature-frame-v2.md b/PRPs/INITIAL/INITIAL-forecast-intelligence-A-feature-frame-v2.md new file mode 100644 index 00000000..1d5227aa --- /dev/null +++ b/PRPs/INITIAL/INITIAL-forecast-intelligence-A-feature-frame-v2.md @@ -0,0 +1,245 @@ +# INITIAL-forecast-intelligence-A-feature-frame-v2.md - Forecast Intelligence A: Feature Frame V2 + +## FEATURE: + +Create ForecastLabAI's second-generation feature-aware forecasting frame. + +This slice expands the existing 14-column canonical feature frame into a richer, leakage-safe retail-demand feature contract that can support better historical forecasting across multiple levels: + +- weekly seasonality +- monthly patterns +- yearly seasonality +- recent rolling demand level +- medium-term trend +- price and promotion effects +- stockout-aware demand signals +- product lifecycle signals +- replenishment cadence +- returns signals +- exogenous weather and macro signals + +Current repo state: + +- `app/shared/feature_frames/contract.py` is the single source of truth for feature-aware forecasting columns. +- Current canonical columns are `lag_1`, `lag_7`, `lag_14`, `lag_28`, calendar cyclic columns, `price_factor`, `promo_active`, `is_holiday`, and `days_since_launch`. +- `app/features/forecasting/service.py` builds historical feature rows for feature-aware models. +- `app/shared/feature_frames/rows.py` builds historical and future feature rows. +- `app/features/featuresets/service.py` already has broader feature engineering families, including lag, rolling, calendar, exogenous, lifecycle, promotion, and replenishment. This PRP should reuse concepts without creating forbidden cross-slice imports. +- `app/features/featuresets/tests/test_leakage.py` and `app/features/forecasting/tests/test_regression_features_leakage.py` are load-bearing leakage specs. + +Problem: + +The app already has feature-aware models, but the forecast-facing feature frame is still too small for retail demand planning. It can learn simple lags, calendar effects, price, promotion, holiday, and product age, but it does not yet expose the richer signals discussed in the brainstorming: + +- explicit yearly lookback such as `lag_364` or `same_week_last_year` +- rolling averages such as `rolling_mean_7`, `rolling_mean_28`, `rolling_mean_90` +- trend features such as `trend_30`, `trend_90`, and recent-vs-prior rolling ratios +- stockout features such as `stockout_days_7`, `stockout_days_28`, and `inventory_available_ratio_28` +- richer lifecycle features such as lifecycle stage, discontinued flag, and days to/from discontinuation +- replenishment features such as days since last replenishment and replenishment count in a trailing window +- returns features such as returns rate over trailing windows +- weather and macro exogenous signals from `exogenous_signal` +- markdown and bundle promotion signals where they are safely available + +Goals: + +- Introduce a versioned Feature Frame V2 contract under `app/shared/feature_frames`. +- Keep Feature Frame V1 compatible for existing model bundles and registry artifacts. +- Add a feature-frame version identifier to model metadata/config where needed. +- Add safe column taxonomy for every new feature: + - safe calendar/static feature + - conditionally safe historical target feature + - unsafe unless supplied for future horizon + - observed-only training feature that must not be inferred at prediction time +- Add pure builders for V2 historical rows and future rows. +- Add DB loader plumbing in the forecasting slice to collect the required sidecar data without importing sibling feature services. +- Preserve strict no-leakage rules: + - no future target values in future frames + - rolling features use history up to origin only + - stockout and inventory features use data knowable at or before origin unless explicitly supplied as a scenario assumption + - future price and promotion are only allowed when supplied by a caller as planned assumptions +- Add tests that prove the new feature families are leakage-safe and aligned column-for-column between training, backtesting, scenarios, and prediction. + +Recommended V2 feature groups: + +1. Target history: + - `lag_1`, `lag_7`, `lag_14`, `lag_28` + - `lag_56` + - `lag_364` or `lag_365` with an explicit retail-calendar decision + - `same_dow_mean_4` + - `same_dow_mean_8` + +2. Rolling demand level: + - `rolling_mean_7` + - `rolling_mean_28` + - `rolling_mean_90` + - `rolling_median_28` + - `rolling_std_28` + +3. Trend: + - `trend_30` + - `trend_90` + - `rolling_mean_7_vs_28` + - `rolling_mean_28_vs_prev_28` + +4. Calendar: + - keep `dow_sin`, `dow_cos`, `month_sin`, `month_cos`, `is_weekend`, `is_month_end` + - consider `week_of_year_sin`, `week_of_year_cos` + - consider `day_of_month_sin`, `day_of_month_cos` + +5. Price and promotion: + - keep `price_factor` + - keep `promo_active` + - add `promo_discount_pct` + - add `promo_kind_markdown_active` + - add `promo_kind_bundle_active` + +6. Inventory and stockout: + - `is_stockout_lag1` + - `stockout_days_7` + - `stockout_days_28` + - `inventory_available_ratio_28` + - optional `lost_sales_proxy_28` if defensible and documented as a proxy, not true demand + +7. Lifecycle: + - keep `days_since_launch` + - add `is_new_product` + - add `is_mature_product` + - add `is_discontinued` + - add `days_until_discontinue` where known + +8. Replenishment: + - `days_since_last_replenishment` + - `replenishment_count_14` + - `replenishment_qty_28` + +9. Returns: + - `returns_qty_7` + - `returns_qty_28` + - `returns_rate_28` + +10. Exogenous: + - weather feature set from `exogenous_signal` where local/store-specific signals exist + - macro feature set where global signals exist + - all future exogenous values must be explicit assumptions or calendar-known facts + +Out of scope: + +- Adding new ML model classes. That belongs to Forecast Intelligence B. +- Frontend controls. That belongs to Forecast Intelligence C. +- Replacing the existing `featuresets` slice. +- Adding unmanaged cloud services. +- Using direct SQL string concatenation. +- Weakening leakage tests. + +Success criteria: + +- Existing feature-aware models can continue using V1 bundles. +- New training requests can select or default into V2 where appropriate. +- V2 feature columns are stable, versioned, and persisted in model metadata. +- Backtesting and scenario future frames can reproduce the exact column order. +- Unit and integration tests prove V2 does not leak future target values. + +## EXAMPLES: + +Reference existing repo examples and patterns: + +- `app/shared/feature_frames/contract.py` + - Existing V1 single source of truth for canonical feature columns and safety classification. + +- `app/shared/feature_frames/rows.py` + - Existing pure row builders for historical and future feature frames. + +- `app/features/forecasting/service.py` + - Existing `_build_regression_features` loader and model training path. + +- `app/features/featuresets/service.py` + - Existing time-safe lag, rolling, calendar, exogenous, lifecycle, promotion, and replenishment compute patterns. + +- `app/features/featuresets/tests/test_leakage.py` + - The leakage test style to mirror. Do not weaken these tests. + +- `app/features/forecasting/tests/test_regression_features_leakage.py` + - Existing forecasting-specific leakage guard. + +- `app/features/scenarios/feature_frame.py` + - Future feature frame construction for scenario simulation and `model_exogenous`. + +- `app/features/backtesting/service.py` + - Fold-level feature construction and evaluation path must stay time-safe. + +- `docs/DATA-SEEDER.md` + - Describes stockouts, exogenous signals, returns, markdowns, bundles, and replenishment data generated by the seeder. + +- `docs/_base/DOMAIN_MODEL.md` + - Existing domain language for `model_exogenous`, replenishment events, and scenario methods. + +Potential example artifact to add: + +- `examples/forecasting/feature_frame_v2_preview.py` + - Read one `(store_id, product_id)` series and print V1 vs V2 feature columns, null counts, and sample rows up to a cutoff. + - This should be read-only and local-development only. + +## DOCUMENTATION: + +External references to review during PRP creation and implementation: + +- scikit-learn lagged features example: https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html +- scikit-learn time-related/cyclical feature engineering: https://sklearn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html +- scikit-learn TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html +- Darts covariates guide, for past vs future covariate terminology: https://unit8co.github.io/darts/userguide/covariates.html +- Prophet seasonality, holidays, and regressors, for additive component vocabulary: https://facebook.github.io/prophet/docs/seasonality%2C_holiday_effects%2C_and_regressors.html +- LightGBM LGBMRegressor API, for feature-aware tree model compatibility: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html +- XGBoost Python API: https://xgboost.readthedocs.io/en/stable/python/ +- pandas time series user guide: https://pandas.pydata.org/docs/user_guide/timeseries.html + +Internal docs to review: + +- `docs/optional-features/05-advanced-ml-model-zoo.md` +- `docs/optional-features/10-baseforecaster-feature-contract.md` +- `docs/optional-features/11-feature-aware-predict-serving.md` +- `docs/PHASE/3-FEATURE_ENGINEERING.md` +- `docs/PHASE/4-FORECASTING.md` +- `docs/DATA-SEEDER.md` +- `docs/_base/ARCHITECTURE.md` +- `docs/_base/RULES.md` + +## OTHER CONSIDERATIONS: + +Implementation constraints: + +- Keep `app/shared/feature_frames` leaf-level. It must not import from `app/features/**`. +- Do not make the canonical feature list horizon-dependent. +- Do not silently fill missing future exogenous inputs with zero if that changes business meaning. +- Do not read future target values to compute future rolling, trend, or lag features. +- Keep V1/V2 compatibility explicit so older artifacts remain loadable. +- Use Pydantic v2 strict schemas for any new request config. +- If DB sidecar loading is required, keep it in the forecasting/backtesting/scenarios services, not in `app/shared`. +- Avoid large abstractions unless they remove real duplication across forecasting, backtesting, and scenarios. + +Testing requirements: + +- Add pure unit tests for each V2 feature group. +- Add leakage regression tests for rolling, trend, yearly lag, stockout, replenishment, returns, and exogenous signals. +- Add tests that V2 future frames emit `NaN` or reject when a feature cannot be known at the forecast origin. +- Add metadata tests proving V2 model bundles persist `feature_frame_version` and column order. +- Add route/service tests for training a V2 feature-aware model. +- Keep all existing baseline model tests green. + +Open design decisions for the PRP: + +- Use `lag_364` or `lag_365` for "same weekday last year". Retail daily data often benefits from `364` because it preserves day-of-week alignment. +- Decide whether rolling features are recursively updated for multi-day horizon or remain origin-fixed. The safer MVP is origin-fixed or `NaN` when unknown. +- Decide whether stockout correction is a feature only or also adjusts the target. MVP should use features only; target rewriting needs a separate explicit design. +- Decide whether Phase 2 exogenous signals are included in V2 MVP or exposed as optional feature groups. +- Decide how the UI will label V2 feature groups later, so metadata should include group names and safety classes. + +Recommended validation commands: + +```bash +uv run ruff check app/shared app/features/forecasting app/features/backtesting app/features/scenarios +uv run ruff format --check app/shared app/features/forecasting app/features/backtesting app/features/scenarios +uv run mypy app/ +uv run pyright app/ +uv run pytest -v app/features/forecasting/tests app/features/backtesting/tests app/features/scenarios/tests app/features/featuresets/tests/test_leakage.py -m "not integration" +``` diff --git a/PRPs/INITIAL/INITIAL-forecast-intelligence-B-model-zoo-backtesting.md b/PRPs/INITIAL/INITIAL-forecast-intelligence-B-model-zoo-backtesting.md new file mode 100644 index 00000000..9b2d01b5 --- /dev/null +++ b/PRPs/INITIAL/INITIAL-forecast-intelligence-B-model-zoo-backtesting.md @@ -0,0 +1,233 @@ +# INITIAL-forecast-intelligence-B-model-zoo-backtesting.md - Forecast Intelligence B: Model Zoo and Backtesting + +## FEATURE: + +Upgrade ForecastLabAI's forecasting model layer so richer historical features actually improve forecasts, backtests, model selection, and registry decisions. + +This slice depends on Forecast Intelligence A if it needs Feature Frame V2. It should not redefine the feature contract itself. Its job is to consume the richer feature frame through the existing forecasting model interface and make the model zoo easier to compare and operationalize. + +Current repo state: + +- Existing model types: + - `naive` + - `seasonal_naive` + - `moving_average` + - `regression` using `HistGradientBoostingRegressor` + - `prophet_like` using a Ridge additive pipeline + - `lightgbm` behind optional `ml-lightgbm` extra and `forecast_enable_lightgbm` + - `xgboost` behind optional `ml-xgboost` extra and `forecast_enable_xgboost` +- `model_family_for()` maps models into `baseline`, `tree`, and `additive`. +- Feature-aware models require `X` for fit and predict. +- Plain `POST /forecasting/predict` rejects feature-aware models because it cannot provide future `X`; scenario simulation handles feature-aware re-forecasting through `model_exogenous`. +- Backtesting already exists and must remain leakage-safe. +- Registry stores model runs, metrics, artifacts, aliases, and model family metadata. + +Problem: + +The app has advanced model classes, but the user workflow still makes it easy to think only in terms of simple baselines. The next step is not just "add more algorithms"; it is to create a disciplined comparison path: + +- better baseline variants +- better feature-aware model configs +- fair backtesting across the same data windows +- metric-driven champion/challenger decisions +- model health that distinguishes "newer" from "better" +- artifact and feature metadata that explain why a model won + +Goals: + +1. Add stronger baseline models: + - `weighted_moving_average` + - `seasonal_average` + - optionally `trend_regression_baseline` + +2. Improve feature-aware model configs: + - allow selecting Feature Frame V1 or V2 where supported + - expose conservative hyperparameters for `regression`, `prophet_like`, `lightgbm`, and `xgboost` + - optionally add `random_forest` as a pure scikit-learn feature-aware model if the PRP finds it valuable and reviewable + +3. Improve backtesting: + - support V2 feature frames per fold without leakage + - compare baselines and feature-aware models on identical folds + - report metrics by horizon bucket, not only aggregate metrics + - include WAPE, sMAPE, MAE, bias, and optional RMSE + - record fold-level metadata needed for UI inspection + +4. Improve registry/model selection: + - store enough metadata to know which feature frame and feature groups trained each run + - distinguish created-at freshness from data-window freshness + - make stale alias logic metric-aware where possible + - support champion/challenger comparison for the same `(store_id, product_id)` and comparable data windows + +5. Improve explainability/metadata: + - feature-aware models should expose feature importances where available + - `prophet_like` should keep additive decomposition into trend, seasonality, and regressor components + - baseline models should retain simple arithmetic explanations + +Recommended user stories: + +- As a demand planner, I want to compare `seasonal_naive`, `seasonal_average`, `weighted_moving_average`, `regression`, and `prophet_like` on the same history so I can see whether extra complexity is justified. +- As a forecasting engineer, I want backtests to use the exact feature frame that prediction will use so that model rankings are trustworthy. +- As an operator, I want a champion alias to be stale only when a newer comparable run is better or requires review, not merely because any newer run exists. + +Out of scope: + +- Building the frontend control surface. That belongs to Forecast Intelligence C. +- Redesigning the database registry from scratch. +- Adding managed-cloud model services. +- AutoML or large hyperparameter sweeps. +- Changing audit timestamps to make historical demo runs "look old". +- Any model that cannot be deterministic enough for this repo's reproducibility goals. + +Expected model additions: + +1. `weighted_moving_average` + - Target-only baseline. + - Gives more weight to recent observations. + - Good for short-term trend without full feature-aware machinery. + - Config fields: `window_size`, `decay` or explicit weight strategy. + +2. `seasonal_average` + - Target-only baseline. + - Forecasts each horizon day from the average of prior matching seasonal positions. + - Example: next Wednesday = average of last N Wednesdays. + - More stable than `seasonal_naive`, which copies one prior cycle. + - Config fields: `season_length`, `lookback_cycles`, optional `trim_outliers`. + +3. `trend_regression_baseline` + - Optional if scope permits. + - Pure target/calendar model using elapsed time and simple calendar features. + - Helps explain demand that rises or falls steadily. + +4. `random_forest` + - Optional feature-aware model. + - Pure scikit-learn dependency, exposes `feature_importances_`. + - Trade-off: weaker extrapolation for trend than additive/linear models, but useful as a robust non-linear baseline. + +Feature-aware models to improve, not duplicate: + +- `regression` +- `prophet_like` +- `lightgbm` +- `xgboost` + +Backtesting expectations: + +- Backtests must build training and future fold frames with the same feature-frame version. +- Do not slice future rows from a historical matrix if that would leak target values. +- Use gap-aware fold logic when configured. +- Store fold metrics in a shape the UI can render as: + - total metric + - metric by horizon bucket + - metric by model family + - metric by feature frame version + +## EXAMPLES: + +Reference existing repo examples and patterns: + +- `app/features/forecasting/models.py` + - Existing `BaseForecaster`, target-only models, feature-aware models, and factory. + +- `app/features/forecasting/schemas.py` + - Model config schema pattern and model family concepts. + +- `app/features/forecasting/feature_metadata.py` + - Feature importance extraction for tree/additive families. + +- `app/features/backtesting/service.py` + - Existing fold orchestration and metric calculation path. + +- `app/features/backtesting/metrics.py` + - Existing WAPE, sMAPE, MAE, bias, and related metric behavior. + +- `app/features/registry/service.py` + - Existing model run and alias persistence. + +- `app/features/ops/service.py` + - Existing model health and stale alias logic should be inspected before changing operational semantics. + +- `app/features/explainability/service.py` + - Baseline explanation path and retail signal warnings. + +- `scripts/run_demo.py` + - Existing end-to-end train/backtest/register/alias flow. + +- `scripts/seed_historical_activity.py` + - Local demo helper currently uncommitted in the working tree, if present, can inspire historical activity generation but should not be treated as merged project API. + +Potential example artifact to add: + +- `examples/forecasting/model_zoo_compare.py` + - Runs a small local comparison for one `(store_id, product_id)` across baseline and feature-aware models. + - Prints metrics and registry candidate summary. + - Should rely on public services/API where practical. + +## DOCUMENTATION: + +External references to review during PRP creation and implementation: + +- scikit-learn lagged features with `HistGradientBoostingRegressor`: https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html +- scikit-learn TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html +- scikit-learn RandomForestRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html +- LightGBM LGBMRegressor: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html +- XGBoost Python API: https://xgboost.readthedocs.io/en/stable/python/ +- Prophet seasonality, holidays, and regressors: https://facebook.github.io/prophet/docs/seasonality%2C_holiday_effects%2C_and_regressors.html +- Darts forecasting covariates: https://unit8co.github.io/darts/userguide/covariates.html +- Nixtla StatsForecast model overview, useful baseline vocabulary: https://nixtlaverse.nixtla.io/statsforecast/src/core/models.html + +Internal docs to review: + +- `docs/optional-features/05-advanced-ml-model-zoo.md` +- `docs/optional-features/09-model-champion-challenger-governance.md` +- `docs/optional-features/10-baseforecaster-feature-contract.md` +- `docs/optional-features/11-feature-aware-predict-serving.md` +- `docs/_base/API_CONTRACTS.md` +- `docs/_base/DOMAIN_MODEL.md` +- `docs/_base/REPO_MAP_INDEX.md` + +## OTHER CONSIDERATIONS: + +Implementation constraints: + +- Preserve the scikit-learn-style `fit(y, X=None)` and `predict(horizon, X=None)` contract. +- Do not break target-only baseline forecasters. +- Gate optional dependencies exactly as the repo already does for LightGBM and XGBoost. +- Keep deterministic fitting where possible: + - fixed `random_state` + - single-threaded where needed + - no stochastic sampling unless explicitly configured and reproducible +- Do not make model selection prefer "newer" when metrics are worse. +- Do not compare runs as champion/challenger unless they share a comparable grain and data window. +- Keep artifact hash verification intact. +- Keep all errors in API routes compatible with the project's RFC 7807 rules where routes are touched. + +Testing requirements: + +- Unit tests for each new model class. +- Factory tests for each new model config. +- Schema tests for strict config validation. +- Backtesting tests proving fold-level V2 features are leakage-safe. +- Registry tests for feature-frame metadata and comparable-run logic. +- Explainability/metadata tests for any new family. +- Route tests for training/backtesting new model types where route behavior changes. +- Integration tests for at least one feature-aware backtest path against real Docker Postgres if DB sidecar data is used. + +Open design decisions for the PRP: + +- Whether `random_forest` is worth adding now or should wait until Feature Frame V2 proves value through existing tree models. +- Whether `seasonal_average` should average by last N cycles or all available matching seasonal positions. +- Whether `weighted_moving_average` uses exponential decay or a simple linear weight ramp. +- How to mark "comparable" runs for stale alias and champion/challenger logic. +- Whether model health should classify `degrading` from all successful runs or only comparable successful runs. +- Whether registry should store the feature frame version as first-class columns or only in JSON metadata. + +Recommended validation commands: + +```bash +uv run ruff check app/features/forecasting app/features/backtesting app/features/registry app/features/ops app/features/explainability +uv run ruff format --check app/features/forecasting app/features/backtesting app/features/registry app/features/ops app/features/explainability +uv run mypy app/ +uv run pyright app/ +uv run pytest -v app/features/forecasting/tests app/features/backtesting/tests app/features/registry/tests app/features/ops/tests app/features/explainability/tests -m "not integration" +uv run pytest -v -m integration app/features/backtesting/tests app/features/registry/tests +``` diff --git a/PRPs/INITIAL/INITIAL-forecast-intelligence-C-interactive-ui.md b/PRPs/INITIAL/INITIAL-forecast-intelligence-C-interactive-ui.md new file mode 100644 index 00000000..53e8ef59 --- /dev/null +++ b/PRPs/INITIAL/INITIAL-forecast-intelligence-C-interactive-ui.md @@ -0,0 +1,280 @@ +# INITIAL-forecast-intelligence-C-interactive-ui.md - Forecast Intelligence C: Interactive UI and Operator Workflow + +## FEATURE: + +Build the UI and interactive workflow layer for richer forecast intelligence. + +This slice makes the Forecast Intelligence A/B backend capabilities usable by planners and operators without forcing them to understand every model internals detail. The UI should let users apply, compare, and vary feature-aware forecasting choices easily. + +Current repo state: + +- Frontend uses React 19, Vite 7, Tailwind 4, shadcn/ui New York, TanStack Query/Table, Recharts. +- Existing relevant pages: + - `frontend/src/pages/visualize/forecast.tsx` + - `frontend/src/pages/visualize/backtest.tsx` + - `frontend/src/pages/visualize/demand.tsx` + - `frontend/src/pages/visualize/planner.tsx` + - `frontend/src/pages/visualize/batch.tsx` + - `frontend/src/pages/explorer/runs.tsx` + - `frontend/src/pages/explorer/run-detail.tsx` + - `frontend/src/pages/explorer/run-compare.tsx` + - `frontend/src/pages/ops.tsx` +- Existing components include charts, feature importance panels, explanation panels, data tables, status badges, job pickers, and batch controls. +- Existing backend surfaces include forecasting, backtesting, registry, ops/model health, explainability, scenarios, batch, and RAG/agents. + +Problem: + +The backend can gain richer features and models, but users need a clear control surface: + +- choose model families +- choose feature frame version or feature packs +- compare simple baselines against richer models +- see why one model is better or worse +- vary assumptions interactively +- understand stale aliases, degrading model health, stockouts, and feature effects +- promote a model only after seeing metric and artifact context + +Goals: + +1. Forecast training UI: + - Add model-family segmented controls: + - Baseline + - Tree + - Additive + - Add model type selector: + - `naive` + - `seasonal_naive` + - `moving_average` + - `weighted_moving_average` if Forecast Intelligence B adds it + - `seasonal_average` if Forecast Intelligence B adds it + - `regression` + - `prophet_like` + - `lightgbm` when enabled + - `xgboost` when enabled + - `random_forest` if added + - Add feature-frame selector: + - V1 safe/default + - V2 extended when available + - Add feature pack toggles if the backend exposes optional groups: + - rolling + - trend + - yearly seasonality + - price/promo + - stockout + - lifecycle + - replenishment + - returns + - exogenous + - Keep defaults conservative and beginner-safe. + +2. Backtest/comparison UI: + - Compare multiple models on the same store/product and same folds. + - Show metric cards: + - WAPE + - sMAPE + - MAE + - bias + - optional RMSE + - Show horizon-bucket metrics if backend supports them. + - Show "newer vs better" distinction so users do not promote a worse fresh run by mistake. + - Add clear badges: + - best WAPE + - lowest bias + - stale alias + - degrading + - stockout-constrained history + - feature-aware + - baseline + +3. Run detail and compare UI: + - Show feature frame version. + - Show enabled feature groups. + - Show top feature importances or additive components. + - Show stockout and inventory caveats near forecasts where relevant. + - Show artifact hash verification status in a visible but compact way. + - Show whether the data window is comparable with the current champion. + +4. Interactive planner UI: + - Allow quick what-if variation: + - price delta slider + - promotion toggle + - holiday toggle + - inventory/stockout assumption + - lifecycle stage assumption where supported + - For feature-aware baselines, use `model_exogenous`. + - For target-only baselines, clearly label heuristic adjustments. + - Show side-by-side baseline vs scenario forecast. + - Show which assumptions are "known future inputs" vs hypothetical. + +5. Model health UI: + - Make "degrading" explainable: + - latest WAPE + - previous comparable WAPE + - delta WAPE + - number of comparable runs + - data window freshness + - Make Promote safer: + - require confirmation when latest WAPE is worse + - show artifact verification + - show champion/challenger comparison + - show why the alias is stale + +6. Batch UI: + - Let users submit model sweeps across multiple model types and feature packs. + - Add presets: + - quick baseline sweep + - feature-aware comparison + - champion challenger refresh + - stockout-sensitive products + - high-WAPE recovery + - Keep PRP-34 parallel execution controls compatible. + +7. Agent/RAG support: + - Add copyable context/actions from UI where useful: + - "Explain why this model degraded" + - "Summarize champion vs challenger" + - "Recommend next backtest" + - RAG should cite user-guide docs and app run context, not invent unsupported model behavior. + +Out of scope: + +- Replacing the whole dashboard IA. +- Creating a marketing-style landing page. +- Adding auth/roles. +- Adding managed-cloud SDKs. +- Adding backend model logic that belongs to Forecast Intelligence A or B. +- Adding agent mutation tools without updating `agent_require_approval`. + +Expected UX principles: + +- Dense but readable operational UI, not a marketing page. +- Use shadcn/ui controls: + - segmented controls or tabs for model family + - Select for model type and feature frame + - Checkbox/toggle for feature packs + - Slider for numeric what-if assumptions + - Dialog/AlertDialog for risky promote actions + - Tooltip for unfamiliar model/metric labels + - DataTable for run comparisons + - Recharts for forecast, error, and metric trends +- Avoid nested cards. +- Keep controls stable in size so labels and dynamic values do not shift layout. +- Do not use in-app tutorial prose for obvious UI behavior. +- Make the first screen an actual working tool, not a landing page. + +## EXAMPLES: + +Reference existing repo examples and patterns: + +- `frontend/src/pages/visualize/forecast.tsx` + - Existing train/predict workflow. + +- `frontend/src/pages/visualize/backtest.tsx` + - Existing backtest workflow and charts. + +- `frontend/src/pages/visualize/planner.tsx` + - Existing what-if scenario workflow. + +- `frontend/src/pages/visualize/batch.tsx` + - Existing batch submit/cancel/parallel controls. + +- `frontend/src/pages/explorer/run-detail.tsx` + - Existing model run detail page. + +- `frontend/src/pages/explorer/run-compare.tsx` + - Existing run comparison page. + +- `frontend/src/pages/ops.tsx` + - Existing model health / stale alias operational page. + +- `frontend/src/components/explainability/explanation-panel.tsx` + - Existing forecast explanation UI. + +- `frontend/src/components/explainability/feature-importance-panel.tsx` + - Existing feature metadata UI. + +- `frontend/src/components/charts/backtest-folds-chart.tsx` + - Existing backtest fold visualization. + +- `frontend/src/hooks/use-runs.ts` +- `frontend/src/hooks/use-ops.ts` +- `frontend/src/hooks/use-batches.ts` +- `frontend/src/hooks/use-feature-metadata.ts` + - Existing API integration patterns. + +- `frontend/src/types/api.ts` + - Update API types here when backend responses add feature frame metadata. + +Potential example artifact to add: + +- `docs/user-guide/advanced-forecasting-guide.md` + - User-facing explanation of model families, feature packs, WAPE, stale aliases, and safe promotion. + - Should be indexable by RAG. + +## DOCUMENTATION: + +External references to review during PRP creation and implementation: + +- shadcn/ui docs: https://ui.shadcn.com/docs +- Radix UI Slider: https://www.radix-ui.com/primitives/docs/components/slider +- TanStack Query docs: https://tanstack.com/query/latest +- TanStack Table docs: https://tanstack.com/table/latest +- Recharts docs: https://recharts.org/en-US/ +- scikit-learn lagged feature forecasting example, for UI labels and mental model: https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html +- Darts covariates guide, for "past vs future covariates" language: https://unit8co.github.io/darts/userguide/covariates.html +- Prophet seasonality/regressor docs, for additive component vocabulary: https://facebook.github.io/prophet/docs/seasonality%2C_holiday_effects%2C_and_regressors.html + +Internal docs to review: + +- `.claude/rules/ui-design.md` +- `docs/user-guide/dashboard-guide.md` +- `docs/user-guide/feature-reference.md` +- `docs/user-guide/agents-and-rag-guide.md` +- `docs/_base/API_CONTRACTS.md` +- `docs/_base/DOMAIN_MODEL.md` +- `docs/_base/REPO_MAP_INDEX.md` +- `PRPs/INITIAL/INITIAL-forecast-intelligence-A-feature-frame-v2.md` +- `PRPs/INITIAL/INITIAL-forecast-intelligence-B-model-zoo-backtesting.md` + +## OTHER CONSIDERATIONS: + +Backend/API prerequisites: + +- This UI slice should be generated after the backend response contracts are clear. +- If Forecast Intelligence A/B have not landed, this PRP should first create UI affordances only for existing fields: + - existing model types + - existing model family + - existing feature metadata + - existing WAPE/model health data +- Do not fake backend values in the UI. + +Frontend constraints: + +- Use existing shadcn/ui components or add them through the repo's shadcn workflow. +- Do not hand-roll components when a local `components/ui/*` component exists. +- Keep URL-shareable filters/sort/page state where existing Explorer pages already do this. +- Keep TypeScript strict and tests green. +- Add component/hook tests for risky conditional rendering: + - missing feature metadata + - optional LightGBM/XGBoost disabled + - stale alias with worse latest WAPE + - artifact verification failure + - target-only model using heuristic scenario method + - feature-aware model using `model_exogenous` + +UX gotchas: + +- "Promote" must not imply "better" when the latest run has worse metrics. +- "Feature-aware" must not imply causal truth. Feature importance explains model arithmetic or split usage, not business causality. +- Stockout caveats must be visible because observed sales can understate true demand. +- "Future covariates" should be labeled as planned or assumed inputs, not known facts unless the business actually knows them. +- Avoid overwhelming users with every raw feature column by default. Show groups first, drill down on demand. + +Recommended validation commands: + +```bash +cd frontend && pnpm tsc --noEmit +cd frontend && pnpm lint +cd frontend && pnpm test --run +uv run pytest -v app/features/forecasting/tests app/features/backtesting/tests app/features/registry/tests app/features/ops/tests -m "not integration" +``` diff --git a/PRPs/INITIAL/INITIAL-forecast-intelligence-index.md b/PRPs/INITIAL/INITIAL-forecast-intelligence-index.md new file mode 100644 index 00000000..ec616b85 --- /dev/null +++ b/PRPs/INITIAL/INITIAL-forecast-intelligence-index.md @@ -0,0 +1,217 @@ +# INITIAL-forecast-intelligence-index.md - Forecast Intelligence Roadmap + +## FEATURE: + +Split the Forecast Intelligence upgrade into three PRP-ready INITIAL briefs. + +This roadmap captures the full extended context from the forecasting brainstorming: + +- The current app already has basic and advanced model families. +- The current app does not yet use enough multi-level historical signals for high-quality retail forecasting. +- The desired direction is feature-aware forecasting that can learn from: + - weekly seasonality + - monthly patterns + - yearly seasonality + - rolling averages + - demand trend + - price effects + - promotion effects + - stockout and inventory signals + - lifecycle signals + - replenishment cadence + - returns + - weather and macro exogenous signals +- The desired UI direction is an interactive Forecast Lab where users can choose, vary, compare, and promote models safely. + +Current repo evidence: + +- Forecasting models exist in `app/features/forecasting/models.py`. +- Model configs exist in `app/features/forecasting/schemas.py`. +- Feature-aware training uses `ForecastingService._build_regression_features`. +- The feature-aware frame contract lives in `app/shared/feature_frames`. +- Scenario simulation uses `model_exogenous` for feature-aware re-forecasting. +- Backtesting, registry, ops/model health, explainability, batch, and frontend pages already exist. + +Important clarification: + +ForecastLabAI does not need a "start from zero" model zoo PRP. It needs an upgrade sequence that preserves existing behavior while expanding the feature signal, comparison rigor, and UI workflow. + +Recommended PRP sequence: + +| Order | INITIAL | Purpose | +| --- | --- | --- | +| 1 | `INITIAL-forecast-intelligence-A-feature-frame-v2.md` | Expand the leakage-safe feature frame to include rolling, trend, yearly, stockout, lifecycle, replenishment, returns, and exogenous signals. | +| 2 | `INITIAL-forecast-intelligence-B-model-zoo-backtesting.md` | Add stronger baseline variants, improve feature-aware model/backtest/registry comparison, and make champion/challenger logic metric-aware. | +| 3 | `INITIAL-forecast-intelligence-C-interactive-ui.md` | Build the UI controls, comparison surfaces, what-if variations, model health explanations, and safe promote workflow. | + +Dependency graph: + +```text +A. Feature Frame V2 + -> B. Model Zoo and Backtesting + -> C. Interactive UI and Operator Workflow +``` + +Parallelism: + +- C can start with design and existing-field UI planning, but implementation should wait for A/B response contracts. +- B can add new target-only baselines before A lands, but any V2 feature-aware work should wait for A. +- A must land before any model relies on V2 columns. + +Full extended context: + +The desired forecasting system should move beyond a single rule such as `seasonal_naive`, where tomorrow is copied from seven days ago. It should let the app reason over several historical layers at once: + +```text +forecast = + weekly seasonality + + monthly pattern + + yearly seasonality + + recent rolling demand level + + medium-term trend + + price effect + + promotion effect + + stockout/inventory correction signal + + lifecycle signal + + replenishment/returns/exogenous signals +``` + +The current app already supports: + +- weekly seasonality through `seasonal_naive`, `lag_7`, and day-of-week features +- monthly calendar features through month sin/cos and month-end +- price through `price_factor` +- promotion through `promo_active` +- holiday through `is_holiday` +- product age through `days_since_launch` +- feature-aware models through `regression`, `prophet_like`, optional `lightgbm`, and optional `xgboost` + +The important gaps are: + +- no explicit yearly lag such as `lag_364` / same-week-last-year +- no forecast-facing rolling averages such as `rolling_mean_7`, `rolling_mean_28`, `rolling_mean_90` +- no explicit trend features such as `trend_30`, `trend_90`, recent-vs-prior ratios +- no model-consumed stockout/inventory correction features +- no model-consumed replenishment, returns, weather, or macro signals +- no stronger baseline variants such as weighted moving average or seasonal average +- no UI-level feature-pack selection +- no easy interactive model comparison across simple vs feature-aware models +- no guardrail that explains "newer run" vs "better run" before promotion + +Brainstormed improvements: + +- Feature packs: + - Basic history + - Rolling demand + - Trend + - Yearly seasonality + - Price/promotion + - Stockout/inventory + - Lifecycle + - Replenishment/returns + - Exogenous weather/macro + +- Better baselines: + - weighted moving average + - seasonal average over last N matching weekdays + - target/calendar trend regression + +- Better feature-aware models: + - richer `regression` + - richer `prophet_like` + - optional `random_forest` + - existing optional `lightgbm` and `xgboost` + +- Better model health: + - classify drift from comparable successful runs + - show WAPE deltas with enough context + - distinguish freshness from quality + - make Promote confirm metric regression + +- Better UI: + - model family segmented control + - model type select + - feature-frame selector + - feature-pack toggles + - price/promo/inventory/lifecycle what-if controls + - side-by-side model comparison + - run detail feature importance and artifact verification + - batch presets for model sweeps + - RAG/agent actions to explain model degradation + +## EXAMPLES: + +Read these before creating PRPs from this roadmap: + +- `PRPs/INITIAL/INITIAL-forecast-intelligence-A-feature-frame-v2.md` +- `PRPs/INITIAL/INITIAL-forecast-intelligence-B-model-zoo-backtesting.md` +- `PRPs/INITIAL/INITIAL-forecast-intelligence-C-interactive-ui.md` +- `PRPs/INITIAL/INITIAL-MLZOO-index.md` +- `PRPs/INITIAL/INITIAL-MLZOO-A-foundation-feature-frames.md` +- `PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md` +- `PRPs/INITIAL/INITIAL-MLZOO-D-frontend-registry-explainability.md` +- `docs/optional-features/05-advanced-ml-model-zoo.md` +- `docs/optional-features/10-baseforecaster-feature-contract.md` +- `docs/optional-features/11-feature-aware-predict-serving.md` +- `app/shared/feature_frames/contract.py` +- `app/shared/feature_frames/rows.py` +- `app/features/forecasting/models.py` +- `app/features/forecasting/service.py` +- `app/features/backtesting/service.py` +- `app/features/scenarios/feature_frame.py` +- `frontend/src/pages/visualize/forecast.tsx` +- `frontend/src/pages/visualize/backtest.tsx` +- `frontend/src/pages/visualize/planner.tsx` +- `frontend/src/pages/explorer/run-detail.tsx` +- `frontend/src/pages/ops.tsx` + +## DOCUMENTATION: + +External references: + +- scikit-learn lagged features with gradient boosting: https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html +- scikit-learn cyclical/time-related feature engineering: https://sklearn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html +- scikit-learn TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html +- scikit-learn RandomForestRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html +- LightGBM LGBMRegressor: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html +- XGBoost Python API: https://xgboost.readthedocs.io/en/stable/python/ +- Prophet seasonality, holidays, and regressors: https://facebook.github.io/prophet/docs/seasonality%2C_holiday_effects%2C_and_regressors.html +- Darts covariates guide: https://unit8co.github.io/darts/userguide/covariates.html +- Nixtla StatsForecast model docs: https://nixtlaverse.nixtla.io/statsforecast/src/core/models.html +- shadcn/ui docs: https://ui.shadcn.com/docs +- TanStack Query docs: https://tanstack.com/query/latest +- TanStack Table docs: https://tanstack.com/table/latest +- Recharts docs: https://recharts.org/en-US/ + +## OTHER CONSIDERATIONS: + +Global constraints: + +- Preserve the vertical-slice architecture. +- Do not import one feature slice's service directly from another slice; use `app/shared` or lazy imports where the repo already uses that pattern. +- Do not weaken leakage tests. +- Do not add managed-cloud SDKs. +- Do not add heavy optional ML dependencies to the core install path. +- Keep feature-frame versions explicit for old artifact compatibility. +- Keep UI implementation consistent with existing shadcn/TanStack/Recharts patterns. +- Keep every PRP reviewable; do not combine A, B, and C into one implementation branch. + +Recommended execution: + +1. Generate a PRP from A first. +2. Implement and merge A. +3. Generate B, adjusting to the actual A result. +4. Implement and merge B. +5. Generate C against the final backend/API contracts. + +Validation expectations: + +- A validates leakage safety and feature-frame compatibility. +- B validates model quality/comparison/backtesting/registry behavior. +- C validates TypeScript, UI behavior, and manual dashboard workflows. + +Suggested future issue titles: + +- `feat(forecasting): add feature frame v2 for retail demand signals` +- `feat(forecasting): add stronger baselines and v2 backtesting comparison` +- `feat(dashboard): add interactive forecast intelligence controls` diff --git a/PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md b/PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md new file mode 100644 index 00000000..a535ca4d --- /dev/null +++ b/PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md @@ -0,0 +1,1103 @@ +name: "PRP-35 — Forecast Intelligence A: Feature Frame V2" +description: | + Expand `app/shared/feature_frames/` from V1 (14 columns) to V2 — a richer, + versioned, leakage-safe feature contract for retail demand forecasting. + Preserve V1 byte-for-byte so existing model bundles, registry rows, and the + load-bearing leakage spec stay green. Slice A of the Forecast Intelligence + roadmap (`PRPs/INITIAL/INITIAL-forecast-intelligence-index.md`). Slice B + (model zoo + backtesting comparison) and Slice C (interactive UI) are + explicitly **out of scope** here. + +## Purpose +A one-pass implementation contract for an AI agent (or human) who has access +to the codebase but no prior session context. The goal is to land V2 as an +additive surface — V1 callers never change, V2 callers opt in by request. + +## Core Principles +1. **V1 is frozen.** Every V1 function, constant, and exported symbol keeps + its current signature, return type, and behaviour. The load-bearing + leakage spec (`app/shared/feature_frames/tests/test_leakage.py`) MUST stay + green without modification. +2. **Leakage safety is the central design constraint.** Every V2 column + carries an explicit `FeatureSafety` class; the future-frame builder + structurally cannot read an observed target at a horizon day. +3. **Version metadata is on the bundle, not on `ModelConfig`.** Adding a + field to `ModelConfigBase` would change every existing `config_hash()` + value (`config_hash()` hashes the full `model_dump_json()`); we instead + put `feature_frame_version` on `TrainRequest` + bundle `metadata`. V1 + registry rows / dedup keys stay stable. +4. **Pure builders, DB-side loaders.** `app/shared/feature_frames/` stays + leaf-level (no `app.features.*` import). Async sidecar loaders live in + `app/features/forecasting/v2_loaders.py`. +5. **NaN-where-unknown.** Every V2 column whose source data lies in the + future (rolling, trend, stockout windows, replenishment count, returns + count, exogenous signal) emits `NaN` at that horizon row. `HistGradient­ + BoostingRegressor` tolerates `NaN` natively (verified, see "Known + Gotchas"). +6. **No target rewriting.** Stockout is exposed as features only; the target + `quantity` is never adjusted for stockouts in V2 (that needs a separate + PRP). + +--- + +## Goal + +Deliver a working `feature_frame_version = 2` end-to-end: + +- A train request can opt into V2 via `TrainRequest.feature_frame_version=2` + and optional `feature_groups=[…]`. +- `_build_regression_features_v2` produces an `[n_observations × N]` feature + matrix (`N` ≥ 14 + V2 additions, ≤ ~30 depending on enabled groups). +- The trained bundle persists `feature_frame_version`, `feature_columns`, + `feature_groups`, and `feature_safety_classes` in `metadata`. +- Scenario `model_exogenous` and backtesting fold construction read those + metadata fields and dispatch to V1 or V2 builders accordingly. +- V1 bundles trained before this PRP still load, predict, scenario-simulate, + and backtest unchanged. +- Every V2 column has a unit test, and the V2 leakage spec parallels the V1 + load-bearing spec. + +## Why + +The current 14-column feature frame can learn weekly seasonality (`lag_7`), +calendar shape, holidays, price, and promotion. It cannot learn: + +- yearly seasonality (`lag_364` preserves DOW; `lag_365` does not — verified) +- recent demand level (rolling means) +- trend (rolling-vs-prior-window ratios) +- stockout-aware demand (lost-sales proxies) +- richer lifecycle (`is_new_product`, `is_mature_product`, `is_discontinued`) +- replenishment cadence (Phase-2 `replenishment_event` data is already in + the DB and unused by the regression frame today) +- returns intensity (Phase-2 `sales_returns` rows, also unused) +- exogenous weather/macro signals (Phase-2 `exogenous_signal` rows, unused) +- richer promotion shape (`promo_kind_markdown_active`, `promo_kind_bundle_ + active`, `promo_discount_pct`) + +The local DB already holds all of these (see HANDOFF.md — 31,420 +`replenishment_event` rows, 9,647 `exogenous_signal` rows, 8,585 +`sales_returns`, 50/50 products with `lifecycle_stage` + `launch_date`). +V2 makes them available to the feature-aware regressor without changing +the model class, the dashboard, or the registry/champion logic. + +## What + +### User-visible behaviour + +- `POST /forecasting/train` accepts an optional `feature_frame_version: int + = 1` and `feature_groups: list[str] | None = None` on the request body. + When omitted, V1 behaviour is preserved exactly. +- `POST /backtesting/run` and `POST /scenarios/simulate` work with both V1 + and V2 bundles transparently. +- `GET /forecasting/runs/{run_id}/feature-metadata` returns the bundle's + `feature_columns`, `feature_groups`, `feature_safety_classes`, + `feature_frame_version`. (UI in Slice C will surface this; we just make + it accessible.) + +### Technical requirements + +- Pydantic v2 strict mode on every new request schema (`ConfigDict(strict= + True)` + `Field(strict=False, ...)` on `date`/`datetime`/`UUID`/`Decimal` + fields — see `docs/_base/SECURITY.md` § "Pydantic v2 strict mode on + FastAPI request bodies"). +- All new SQL queries use SQLAlchemy 2.0 parameter binding and time-safe + `<= cutoff_date` filters at the SQL boundary. +- All five validation gates pass: `ruff check` + `ruff format --check` + + `mypy --strict` + `pyright --strict` + `pytest`. +- `app/shared/feature_frames/**` remains leaf-level (the AST-walk invariant + in `tests/test_contract.py` continues to assert no `app.features.*` + import). + +### Success Criteria + +- [ ] V1 leakage spec (`app/shared/feature_frames/tests/test_leakage.py`) + passes unchanged. **Not weakened.** +- [ ] New V2 leakage spec (`app/shared/feature_frames/tests/test_leakage_v2.py`) + passes; every V2 column has at least one assertion proving it cannot read + a future target. +- [ ] A V1 bundle saved before this PRP loads, predicts, scenario-simulates, + and backtests with no errors — V1/V2 dispatch is transparent. +- [ ] A V2 training request produces a bundle whose `metadata` carries + `feature_frame_version=2`, `feature_columns=[…]`, + `feature_groups={group_name: [columns]}`, and + `feature_safety_classes={column: "safe"|"conditionally_safe"|"unsafe_ + unless_supplied"}`. +- [ ] V2 future-frame assembly emits `NaN` for every cell whose source day + > T (long lag, rolling, trend, stockout-window, replenishment-window, + returns-window). +- [ ] All four `lag_*` and `same_dow_mean_*` cells at horizon day `j` are + `NaN` exactly when `(j-1) - k >= 0` (the V1 invariant generalised). +- [ ] `lag_364` (not `lag_365`) is the canonical yearly lag (verified DOW + preservation). +- [ ] No cross-slice import — `app/shared/feature_frames/**` imports + nothing from `app.features.**` (AST-walk invariant test passes). +- [ ] All five validation gates green: `uv run ruff check . && uv run ruff + format --check . && uv run mypy app/ && uv run pyright app/ && uv run + pytest -v -m "not integration"`. + +--- + +## All Needed Context + +### Documentation & References + +```yaml +- file: app/shared/feature_frames/contract.py + why: V1 single source of truth — pinned constants, canonical columns, FeatureSafety taxonomy, pure long-lag + calendar builders. The "shape of V2" must mirror this file exactly. + +- file: app/shared/feature_frames/rows.py + why: V1 row assemblers (historical + future). V2 row assemblers mirror these two functions (build_historical_feature_rows_v2 / build_future_feature_rows_v2). + +- file: app/shared/feature_frames/__init__.py + why: V1 public surface. V2 names are added to __all__ alongside (not replacing) V1. + +- file: app/shared/feature_frames/tests/test_contract.py + why: V1 contract tests AND the AST-walk invariant that pins "shared/** never imports features/**". V2 tests follow the same style; the AST-walk must still pass on V2 modules. + +- file: app/shared/feature_frames/tests/test_leakage.py + why: V1 load-bearing leakage spec. MUST stay byte-stable. V2's parallel spec at tests/test_leakage_v2.py uses the same idioms (sequential targets so leakage is mathematically detectable; disjoint future-target set; pytest.mark.parametrize over gap values). + +- file: app/features/forecasting/service.py + why: Where `_build_regression_features` lives (line 515). V2 adds a sibling `_build_regression_features_v2` and a router method `_build_regression_features` (no version) that dispatches on `request.feature_frame_version`. Bundle metadata is enriched at line 280-287. + +- file: app/features/forecasting/persistence.py + why: ModelBundle and save/load. No schema change — `metadata: dict[str, object]` already accepts arbitrary keys. V2 metadata fields ride in there. Load-side back-compat: `bundle.metadata.get("feature_frame_version", 1)` defaults V1. + +- file: app/features/forecasting/schemas.py + why: TrainRequest at line 284 — strict=True with date_type Field(strict=False) for FastAPI JSON-body compatibility (docs/_base/SECURITY.md). New `feature_frame_version: int = 1` and `feature_groups: list[str] | None = None` fields added here. + +- file: app/features/scenarios/feature_frame.py + why: build_future_frame (line 232) already reads `feature_columns` from the bundle and threads it through. V2 work here: the assemble_future_frame function (line 181) needs a V2 branch that consumes V2 sidecars (lifecycle, knowable-only) for assumption-driven V2 columns. Where a V2 column has no future input (e.g. weather forecast), it stays NaN. + +- file: app/features/backtesting/service.py + why: Calls build_historical_feature_rows (line 493) and build_future_feature_rows (line 553) WITHOUT a feature_columns argument — so today's path hard-uses canonical_feature_columns() (V1). V2 work: pass the bundle's recorded version + columns through, dispatch to V1 or V2 builders. + +- file: app/features/featuresets/service.py + why: PATTERN ONLY (no import). Existing rolling / trend / stockout / lifecycle / promotion / replenishment compute idioms — V2 builders mirror the safety idioms (groupby(entity).shift(1).rolling(window) for time-safe rolling) without importing this slice. + +- file: app/features/data_platform/models.py + why: Authoritative ORM for `inventory_snapshot_daily` (lines 345-383), `replenishment_event` (471-514), `sales_returns` (439-468), `exogenous_signal` (386-436), `promotion` (274-342), `product` (68-126). V2 loaders read these tables directly. + +- file: app/features/forecasting/tests/test_regression_features_leakage.py + why: V1 forecasting-specific leakage spec — pattern for V2 to mirror at app/features/forecasting/tests/test_regression_features_v2_leakage.py. + +- file: app/features/scenarios/tests/test_future_frame_leakage.py + why: V1 scenarios leakage spec — pattern for V2 future-frame leakage tests in scenarios slice. + +- url: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html + section: "Missing values support" + critical: HGBR tolerates NaN natively in both fit() and predict(). Verified in this codebase at `uv run python -c "...HistGradientBoostingRegressor; m.fit(X_with_nan, y)..."` (PRP § Known Gotchas). + +- url: https://pandas.pydata.org/docs/user_guide/timeseries.html + section: "Rolling windows" + critical: Default `min_periods` equals the window size. Verified: `pd.Series([1..8]).rolling(3).mean()` returns [nan, nan, 2.0, 3.0, ...]. The leakage-safe idiom is `s.shift(1).rolling(window).mean()` — V2 rolling features use this composition. + +- url: https://scikit-learn.org/stable/auto_examples/applications/plot_time_series_lagged_features.html + section: "Cyclical / lagged features" + critical: The lag + calendar + cyclical pattern this PRP extends. + +- docfile: PRPs/ai_docs/exogenous-regressor-forecasting.md + why: Pre-existing ai_doc on past vs future covariates terminology — useful framing for the V2 OBSERVED_ONLY note. + +- file: docs/DATA-SEEDER.md + why: Documents what the seeder produces for inventory, replenishment, returns, exogenous signals, markdowns, bundles — i.e. what V2 sidecar loaders will see. + +- file: docs/_base/SECURITY.md + section: "Pydantic v2 strict mode on FastAPI request bodies" + critical: Every new request-body field whose Python type lacks a native JSON representation (date, datetime, UUID, Decimal) MUST carry `Field(strict=False, ...)` to avoid breaking JSON-string inputs. `feature_frame_version: int` and `feature_groups: list[str] | None` are JSON-native so they need no override. + +- file: docs/_base/RULES.md + why: NEVER weaken the leakage tests; NEVER skip mypy/pyright strict; NEVER edit a merged Alembic migration; NEVER widen the agent's mutation surface without updating agent_require_approval. (None of those are violated by this PRP — it adds no migrations, no agent tools, no mutating endpoints.) + +- file: PRPs/PRP-29-feature-aware-forecasting-foundation.md + why: The V1 PRP. Read for tone, structure, and to see how the "feature contract is the source of truth" principle was originally landed. V2 inherits all of its safety idioms. + +- file: PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md + why: The PRP that promoted the row assemblers from forecasting to app/shared. Documents how the historical / future asymmetry was solved. +``` + +### Current Codebase tree (relevant slice) + +``` +app/ +├── shared/ +│ └── feature_frames/ +│ ├── __init__.py # V1 public surface (and where V2 names will be added) +│ ├── contract.py # V1 — pinned constants, canonical columns, taxonomy, pure builders +│ ├── rows.py # V1 — historical and future row assemblers +│ └── tests/ +│ ├── __init__.py +│ ├── test_contract.py # V1 contract tests + AST-walk leaf-level invariant +│ └── test_leakage.py # V1 load-bearing leakage spec — DO NOT WEAKEN +├── features/ +│ ├── forecasting/ +│ │ ├── service.py # _build_regression_features (V1) at line 515 +│ │ ├── persistence.py # ModelBundle.metadata is dict[str, object] — V2 metadata rides here +│ │ ├── schemas.py # TrainRequest at line 284; ModelConfig union at 268 +│ │ ├── models.py # BaseForecaster.requires_features at line 109 +│ │ └── tests/test_regression_features_leakage.py # V1 forecasting leakage spec — DO NOT WEAKEN +│ ├── backtesting/ +│ │ └── service.py # calls V1 row builders at lines 493, 553 — V2 dispatch lands here +│ ├── scenarios/ +│ │ ├── feature_frame.py # assemble_future_frame at line 181, build_future_frame at line 232 +│ │ └── tests/ +│ │ ├── test_future_frame_leakage.py # V1 scenarios leakage spec — DO NOT WEAKEN +│ │ └── test_leakage.py +│ ├── featuresets/ +│ │ ├── service.py # PATTERN ONLY (rolling/trend/stockout/lifecycle compute idioms) +│ │ └── tests/test_leakage.py # other load-bearing leakage spec — DO NOT WEAKEN +│ └── data_platform/ +│ └── models.py # sidecar ORM: InventorySnapshotDaily, ReplenishmentEvent, SalesReturn, ExogenousSignal, Promotion, Product +└── core/ + └── config.py # Settings — no new keys needed +``` + +### Desired Codebase tree (new files) + +``` +app/ +├── shared/ +│ └── feature_frames/ +│ ├── __init__.py # MODIFIED — adds V2 exports next to V1 +│ ├── contract.py # UNCHANGED +│ ├── contract_v2.py # NEW — V2 column manifest, group taxonomy, pure pandas-free builders +│ ├── rows.py # UNCHANGED +│ ├── rows_v2.py # NEW — V2 historical + future row assemblers +│ ├── sidecar.py # NEW — V2HistoricalSidecar / V2FutureSidecar dataclasses (pure data carriers) +│ └── tests/ +│ ├── test_contract.py # UNCHANGED (still asserts AST-walk against new files) +│ ├── test_leakage.py # UNCHANGED — DO NOT WEAKEN +│ ├── test_contract_v2.py # NEW — V2 contract + taxonomy + group manifest tests +│ └── test_leakage_v2.py # NEW — LOAD-BEARING V2 leakage spec (mirror of test_leakage.py) +├── features/ +│ ├── forecasting/ +│ │ ├── service.py # MODIFIED — V2 dispatch + _build_regression_features_v2 + V2 metadata persistence +│ │ ├── schemas.py # MODIFIED — TrainRequest gains feature_frame_version + feature_groups; FeatureMetadataResponse gains V2 fields (additive) +│ │ ├── v2_loaders.py # NEW — async sidecar loaders (inventory, replenishment, returns, exogenous, promotion, lifecycle); leaf-level wrt other slices +│ │ └── tests/ +│ │ ├── test_regression_features_leakage.py # UNCHANGED — DO NOT WEAKEN +│ │ ├── test_regression_features_v2_leakage.py # NEW — V2 leakage spec at the forecasting-slice layer +│ │ ├── test_v2_loaders.py # NEW — DB integration tests for the loaders +│ │ └── test_service_v2.py # NEW — end-to-end V2 train test (integration; uses docker-compose Postgres) +│ ├── backtesting/ +│ │ ├── service.py # MODIFIED — read feature_frame_version from bundle; dispatch row builders +│ │ └── tests/ +│ │ └── test_feature_aware_backtest_v2.py # NEW — V2 fold leakage test +│ └── scenarios/ +│ ├── feature_frame.py # MODIFIED — assemble_future_frame dispatches on feature_frame_version from bundle/metadata +│ └── tests/ +│ └── test_future_frame_v2_leakage.py # NEW — V2 scenarios leakage spec +└── examples/ + └── forecasting/ + └── feature_frame_v2_preview.py # NEW — read-only V1 vs V2 column dump for a (store, product) pair +``` + +### Known Gotchas of our codebase & Library Quirks + +```python +# ───────────────────────────────────────────────────────────────────────── +# CRITICAL: V1 must keep working. Three concrete risks to avoid: +# ───────────────────────────────────────────────────────────────────────── + +# 1. config_hash() drift +# `app/features/forecasting/schemas.py:43-50` hashes the entire +# `model_dump_json()`. Adding `feature_frame_version` to `ModelConfigBase` +# would silently change *every* V1 config's hash, breaking the registry +# dedup key and orphaning every "champion"/"production" alias. +# POLICY: put `feature_frame_version` on `TrainRequest`, NOT on +# `ModelConfigBase`. Bundle metadata records the resolved version. + +# 2. Backtesting hard-codes canonical_feature_columns() at the builder call +# site (`app/features/backtesting/service.py:493, 553`). The V1 builders +# today internally call canonical_feature_columns(); they have no +# `feature_columns` or `feature_frame_version` parameter and are NOT to +# be modified by this PRP (V1 is frozen — Core Principle #1). For V2: +# - DO NOT add `feature_frame_version`, `feature_columns`, or any other +# parameter to V1 `build_historical_feature_rows` / +# `build_future_feature_rows` — V1 signatures, return types, and +# bodies remain byte-stable. +# - DO ship NEW sibling functions `build_historical_feature_rows_v2` and +# `build_future_feature_rows_v2` in `app/shared/feature_frames/rows_v2.py` +# (Task 3). V2 callers invoke the V2 functions; V1 callers continue to +# invoke the V1 functions unchanged. +# - Dispatch (V1 vs V2) happens EXCLUSIVELY at the service layer — +# `forecasting/service.py` train_model branches on +# `request.feature_frame_version`; `backtesting/service.py` and +# `scenarios/feature_frame.py` read `feature_frame_version` from the +# bundle metadata. `app/shared/feature_frames/` itself contains no +# runtime dispatch logic. +# - When `feature_frame_version` is absent from a bundle's metadata, +# service-layer code defaults it to 1 (`bundle.metadata.get( +# "feature_frame_version", 1)`) — legacy bundles route to V1 builders +# unchanged. + +# 3. The load-bearing leakage tests use SEQUENTIAL targets (1.0, 2.0, ..., +# 60.0) so any leakage is mathematically detectable. V2 leakage tests +# use the same trick PLUS a DISJOINT future-target set +# ({9000.0..9999.0}) for the future-frame builder so leakage is +# detectable by set membership. Mirror exactly. + +# ───────────────────────────────────────────────────────────────────────── +# Library verifications (run before locking PRP claims, mandated by +# the prp-create skill's "Third-party API runtime verification" rule): +# ───────────────────────────────────────────────────────────────────────── + +# VERIFIED: HistGradientBoostingRegressor tolerates NaN in fit() and predict() +# uv run python -c " +# from sklearn.ensemble import HistGradientBoostingRegressor +# import numpy as np +# X = np.array([[1.0, np.nan], [2.0, 0.5], [3.0, 1.5], [4.0, np.nan], [5.0, 2.5]]) +# y = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) +# m = HistGradientBoostingRegressor(max_iter=5); m.fit(X, y) +# print(m.predict(np.array([[6.0, np.nan]]))[0]) +# " +# Output: 3.0 (no exception). sklearn 1.8.0. + +# VERIFIED: pandas .rolling(window).mean() default min_periods == window +# uv run python -c " +# import pandas as pd +# s = pd.Series([1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]) +# print(list(s.rolling(3).mean())) +# " +# Output: [nan, nan, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]. pandas 3.0.3. +# Use this default — the leading NaNs are the leakage-safe answer. +# V2 rolling uses `s.shift(1).rolling(window).mean()` so row i reads +# strictly earlier observations only. + +# VERIFIED: lag_364 preserves day-of-week; lag_365 does NOT +# uv run python -c " +# from datetime import date, timedelta +# d = date(2026, 6, 15) # Monday +# print((d - timedelta(days=364)).weekday(), # 0 = Monday — PRESERVED +# (d - timedelta(days=365)).weekday()) # 6 = Sunday — shifted +# " +# POLICY: V2 uses `lag_364` for "same weekday last year". The INITIAL's +# open design decision is RESOLVED in favour of lag_364. + +# VERIFIED: joblib round-trips arbitrary metadata dicts; legacy-bundle +# back-compat via dict.get(key, default) +# uv run python -c " +# import joblib, tempfile, os +# sample = {'feature_columns': ['lag_1','lag_7'], 'feature_frame_version': 2} +# with tempfile.NamedTemporaryFile(suffix='.joblib', delete=False) as f: +# joblib.dump(sample, f.name); fname = f.name +# loaded = joblib.load(fname); os.unlink(fname) +# legacy = {'feature_columns': ['lag_1','lag_7']} +# print(loaded == sample, legacy.get('feature_frame_version', 1)) +# " +# Output: True 1. joblib 1.5.3. + +# ───────────────────────────────────────────────────────────────────────── +# Repo-specific failure modes to avoid (anchored in memory + prior PRPs): +# ───────────────────────────────────────────────────────────────────────── + +# - DO NOT cite `HistGradientBoostingRegressor.feature_importances_` — it +# does not exist on HGBR; sklearn exposes it on `GradientBoostingRegressor` +# only (memory `histgbr-no-feature-importances`, issue #258). V2 leaves +# feature-importance extraction untouched in this PRP; Slice B owns model +# work. + +# - SimpleImputer in sklearn 1.2+ defaults to `keep_empty_features=False`, +# silently dropping all-NaN columns and shortening downstream coef arrays +# (memory `simpleimputer-drops-empty-columns`). V2 does NOT use +# SimpleImputer at the row-builder layer — the matrix carries NaN +# directly to HGBR. If a downstream consumer adds imputation later +# (Slice B / a new ridge model), it MUST pass `keep_empty_features=True`. + +# - Pydantic v2 strict mode + FastAPI: `ConfigDict(strict=True)` on a request +# body causes FastAPI to reject ISO-string date inputs (a 422 storm). +# `feature_frame_version: int` and `feature_groups: list[str] | None` are +# JSON-native so they need no `Field(strict=False, ...)` override. + +# - app/shared/** never imports app/features/** — the AST-walk invariant in +# tests/test_contract.py catches violations. V2 sidecar dataclasses live +# in app/shared/feature_frames/sidecar.py and stay leaf-level; the DB +# loading lives in app/features/forecasting/v2_loaders.py. + +# - Backtesting cross-slice rule: `backtesting -> forecasting` is forbidden; +# `backtesting -> app/shared` is allowed. V2 dispatch in backtesting reads +# feature_frame_version from the bundle.metadata (not from a forecasting +# service call) and routes to app/shared/feature_frames/rows_v2. + +# - Mixed line endings warning (memory `repo-line-endings-crlf`): on this +# host some files are CRLF and Edit/Write emit LF. Check `git diff --stat` +# before committing any modified file to avoid whole-file noise diffs. +``` + +--- + +## Implementation Blueprint + +### Data models and structure + +```python +# ─── app/shared/feature_frames/contract_v2.py ───────────────────────────── +from enum import Enum +from dataclasses import dataclass + +# Version tag (also persisted to bundle metadata) +FEATURE_FRAME_VERSION_V1: int = 1 +FEATURE_FRAME_VERSION_V2: int = 2 + +# Pinned V2 modelling constants — DECISIONS LOCKED in this PRP +EXOGENOUS_LAGS_V2: tuple[int, ...] = (1, 7, 14, 28, 56, 364) # lag_364 (DOW-aligned) +ROLLING_WINDOWS_V2: tuple[int, ...] = (7, 28, 90) # same-DOW-mean uses (4, 8) +TREND_WINDOWS_V2: tuple[int, ...] = (30, 90) +STOCKOUT_WINDOWS_V2: tuple[int, ...] = (7, 28) +REPLENISHMENT_WINDOWS_V2: tuple[int, ...] = (14, 28) +RETURNS_WINDOWS_V2: tuple[int, ...] = (7, 28) +INVENTORY_AVAILABILITY_WINDOW_V2: int = 28 +# Observed-target tail length: max(EXOGENOUS_LAGS_V2 + ROLLING_WINDOWS_V2) + safety +HISTORY_TAIL_DAYS_V2: int = 400 # >= 364 + 28 buffer + +# Feature groups (used to enable/disable + label in Slice C metadata) +class FeatureGroup(str, Enum): + TARGET_HISTORY = "target_history" # lag_1, lag_7, ..., lag_364, same_dow_mean_* + ROLLING = "rolling" # rolling_mean_7/28/90, rolling_median_28, rolling_std_28 + TREND = "trend" # trend_30, trend_90, rolling_mean_7_vs_28, rolling_mean_28_vs_prev_28 + CALENDAR = "calendar" # V1 calendar + week_of_year_sin/cos, day_of_month_sin/cos + PRICE_PROMO = "price_promo" # V1 price_factor/promo_active + promo_discount_pct, promo_kind_markdown_active, promo_kind_bundle_active + INVENTORY = "inventory" # is_stockout_lag1, stockout_days_7/28, inventory_available_ratio_28 + LIFECYCLE = "lifecycle" # days_since_launch, is_new_product, is_mature_product, is_discontinued, days_until_discontinue + REPLENISHMENT = "replenishment" # days_since_last_replenishment, replenishment_count_14, replenishment_qty_28 + RETURNS = "returns" # returns_qty_7, returns_qty_28, returns_rate_28 + EXOGENOUS_WEATHER = "exogenous_weather" # store-specific weather signals (NaN if unavailable in future) + EXOGENOUS_MACRO = "exogenous_macro" # global macro signals (NaN if unavailable in future) + +# Default V2 groups when feature_groups is None — every group with a fully- +# determinate future projection. Phase 2 sidecars off by default to keep +# the MVP green on smaller seeded DBs. +DEFAULT_V2_GROUPS: tuple[FeatureGroup, ...] = ( + FeatureGroup.TARGET_HISTORY, + FeatureGroup.ROLLING, + FeatureGroup.TREND, + FeatureGroup.CALENDAR, + FeatureGroup.PRICE_PROMO, + FeatureGroup.LIFECYCLE, +) + +@dataclass(frozen=True) +class V2ColumnSpec: + """One V2 feature column — name, group, safety class.""" + name: str + group: FeatureGroup + safety: FeatureSafety # SAFE | CONDITIONALLY_SAFE | UNSAFE_UNLESS_SUPPLIED + +def v2_column_manifest( + groups: tuple[FeatureGroup, ...] = DEFAULT_V2_GROUPS, +) -> list[V2ColumnSpec]: + """The ordered, canonical V2 column manifest for the given groups. + Order: target_history → calendar → rolling → trend → price_promo → + inventory → lifecycle → replenishment → returns → exogenous_* + """ + ... + +def canonical_feature_columns_v2( + groups: tuple[FeatureGroup, ...] = DEFAULT_V2_GROUPS, +) -> list[str]: + """Equivalent of canonical_feature_columns() for V2.""" + return [spec.name for spec in v2_column_manifest(groups)] + + +# ─── app/shared/feature_frames/sidecar.py ───────────────────────────────── +from datetime import date + +@dataclass(frozen=True) +class V2HistoricalSidecar: + """Pure data carrier for everything V2 historical builder needs beyond + the V1 inputs. + + Alignment contract (ENFORCED — violation → ValueError in the builder): + - Every per-day array (on_hand_qty, is_stockout_per_day, returns_qty_per_day, + promo_kinds_per_day, promo_discount_pct_per_day) has length equal to + `len(dates)` whenever its owning group is enabled. + - Sets / mappings (promo_dates, holiday_dates, weather_per_day, + macro_per_day) are queried by membership; absent keys for a given date + → NaN at that cell, never zero-fill. + - replenishment_event_dates / replenishment_event_qty are event-time + (one entry per event), NOT per-day-aligned; length parity between + these two tuples is the only alignment invariant. + + Group enablement vs. data presence: + - If a FeatureGroup is NOT passed in the builder's `groups` argument, + this sidecar's corresponding fields MAY be empty (the builder won't + read them) and NO column for that group is emitted. + - If a FeatureGroup IS in `groups` but a specific day has no source + data inside the matching sidecar field (e.g. `on_hand_qty[i] is None`, + no replenishment event before day i, missing weather entry for the + date), the column cell at row i is NaN. HGBR consumes NaN directly. + - If a FeatureGroup IS in `groups` and its sidecar field's per-day array + length disagrees with `len(dates)`, the builder raises ValueError — + that's a programmer/contract error, not a "missing data" case. + """ + # V1 carryover + promo_dates: set[date] + holiday_dates: set[date] + launch_date: date | None + # Lifecycle + discontinue_date: date | None + # Inventory (per-day, aligned with dates) + on_hand_qty: tuple[float | None, ...] + is_stockout_per_day: tuple[bool, ...] + # Replenishment (timestamps, NOT per-day) + replenishment_event_dates: tuple[date, ...] + replenishment_event_qty: tuple[int, ...] + # Returns (per-day quantity, 0 when no return) + returns_qty_per_day: tuple[int, ...] + # Promotion (per-day kind set + discount pct) + promo_kinds_per_day: tuple[frozenset[str], ...] # {"pct_off","markdown","bogo","bundle"} subset per day + promo_discount_pct_per_day: tuple[float, ...] # 0.0 when no discount; else 0.0..1.0 + # Exogenous (date → signal_name → value) + weather_per_day: dict[date, dict[str, float]] + macro_per_day: dict[date, dict[str, float]] + +@dataclass(frozen=True) +class V2FutureSidecar: + """Inputs the future-frame builder accepts when re-forecasting. + EVERY field is either knowable at origin T (calendar, launch date, + discontinue_date), or *posited by the caller as an assumption* + (price, promotion, holiday); for the truly-unknowable groups + (weather, macro) the caller MAY supply observed-then-projected values + or leave them None → the future column is NaN. + """ + holiday_dates: set[date] # calendar + scenario assumption + launch_date: date | None + discontinue_date: date | None + # Future inputs — None means "not posited" → corresponding column = NaN + price_factor_per_day: tuple[float | None, ...] + promo_active_per_day: tuple[bool, ...] + promo_kinds_per_day: tuple[frozenset[str], ...] + promo_discount_pct_per_day: tuple[float, ...] + # Phase 2 future inputs — typically None for V2 MVP + inventory_on_hand_per_day: tuple[float | None, ...] + weather_per_day: dict[date, dict[str, float]] + macro_per_day: dict[date, dict[str, float]] +``` + +### List of tasks to be completed (dependency-ordered) + +```yaml +Task 1 — CREATE app/shared/feature_frames/contract_v2.py: + - DEFINE FEATURE_FRAME_VERSION_V1 = 1, FEATURE_FRAME_VERSION_V2 = 2 + - DEFINE pinned constants (EXOGENOUS_LAGS_V2, ROLLING_WINDOWS_V2, etc.) + - DEFINE FeatureGroup enum with the 11 groups from the data model above + - DEFINE V2ColumnSpec frozen dataclass + - IMPLEMENT v2_column_manifest(groups) → list[V2ColumnSpec] (ordered: target_history → calendar → rolling → trend → price_promo → inventory → lifecycle → replenishment → returns → weather → macro) + - IMPLEMENT canonical_feature_columns_v2(groups) → list[str] + - IMPLEMENT v2_feature_groups_dict(columns) → dict[str, list[str]] (group_name → columns) + - IMPLEMENT v2_feature_safety_classes(columns) → dict[str, str] (column → safety.value) + - PURE: stdlib only (math, datetime, dataclasses, enum); never imports app.features.* + - MIRROR the V1 docstring conventions (load-bearing leakage rule restated) + - VERIFY: every column in DEFAULT_V2_GROUPS resolves through feature_safety_v2(column) + +Task 2 — CREATE app/shared/feature_frames/sidecar.py: + - DEFINE V2HistoricalSidecar frozen dataclass (per data-model section above) + - DEFINE V2FutureSidecar frozen dataclass (per data-model section above) + - PURE: stdlib only; never imports app.features.* + - DOC: explain the alignment invariants (all per-day arrays align with `dates`; replenishment_event_* is event-time not day-time) + +Task 3 — CREATE app/shared/feature_frames/rows_v2.py: + - IMPLEMENT build_historical_feature_rows_v2( + *, dates, quantities, prices, baseline_price, sidecar: V2HistoricalSidecar, groups: tuple[FeatureGroup, ...] + ) -> list[list[float]] + - IMPLEMENT build_future_feature_rows_v2( + *, test_dates, history_tail, gap, baseline_price, sidecar: V2FutureSidecar, history_tail_dates: list[date], history_tail_stockouts: list[bool], history_tail_replenishment_dates: list[date], history_tail_returns_qty: list[int], groups + ) -> list[list[float]] + - REUSE V1 builders: build_long_lag_columns, build_calendar_columns + - EXTEND lags: add lag_56, lag_364 by parameterising V1 build_long_lag_columns with EXOGENOUS_LAGS_V2 + - ADD same_dow_mean_4, same_dow_mean_8: helper that picks the 4 (or 8) same-weekday observations before each row + - ADD rolling_mean_7/28/90, rolling_median_28, rolling_std_28: leakage-safe via "history_tail[-W..-1]" indexing (pure Python; no pandas needed — the tail is at most HISTORY_TAIL_DAYS_V2) + - ADD trend_30, trend_90: linear-slope over the trailing W days (numpy.polyfit on the tail) + - ADD rolling_mean_7_vs_28, rolling_mean_28_vs_prev_28: ratio columns (NaN-safe division) + - ADD week_of_year_sin/cos, day_of_month_sin/cos: pure date functions + - ADD promo_discount_pct, promo_kind_markdown_active, promo_kind_bundle_active: from sidecar.promo_kinds_per_day and promo_discount_pct_per_day + - ADD is_stockout_lag1, stockout_days_7/28, inventory_available_ratio_28: stockout windows + on_hand / max(on_hand-history) ratio + - ADD is_new_product, is_mature_product, is_discontinued, days_until_discontinue: derived from launch_date + discontinue_date thresholds (intro ≤ 30d, mature ≥ 180d) + - ADD days_since_last_replenishment, replenishment_count_14, replenishment_qty_28: from sidecar.replenishment_event_dates + - ADD returns_qty_7/28, returns_rate_28: from sidecar.returns_qty_per_day; rate = returns_qty / max(sales_qty, 1) + - For future builder: NaN-where-future is enforced cell-by-cell — NEVER read history_tail beyond the supplied tail; NEVER fabricate a value when source day > T + - GROUP-GATED COLUMN EMISSION: the column manifest is derived ENTIRELY from the `groups` parameter. If a FeatureGroup is NOT in `groups`, NO column from that group appears in the output matrix or in `feature_columns`. (i.e. disabled group = silent omission, not NaN-filled placeholder.) + - PER-CELL NaN: when a group IS enabled but a specific day lacks source data (e.g. INVENTORY enabled but `sidecar.on_hand_qty[i] is None`, REPLENISHMENT enabled but no event has occurred before day i, EXOGENOUS_WEATHER enabled but `sidecar.weather_per_day` has no entry for that date), the corresponding cell is `NaN`. HGBR tolerates NaN; downstream consumers MUST NOT impute with zero. + - LOUD failure (ValueError) — ONLY for programmer / contract errors: + * `groups` is empty (would produce a zero-column matrix — that's a misuse, not "no features"). + * `groups` contains a name that does not match any `FeatureGroup` enum value (unsupported requested group). + * A sidecar per-day array length does not match `len(dates)` (alignment contract violated). + * A sidecar mapping references a date outside the `dates` range when the column's spec requires alignment. + * Required scalar inputs are missing for an enabled group (e.g. INVENTORY enabled but `sidecar.on_hand_qty` field is entirely absent — distinct from "present but all None"). + NEVER raise ValueError merely because a specific day has no source data within an enabled group; that's the NaN case. + - NEVER silent zero-fill any sidecar source — zero is a real demand-domain value (0 units returned, 0 stockout days, $0 discount) and would corrupt the feature signal. Use NaN for "unknown" and let the model see it. + - PURE: stdlib + numpy (for polyfit only); never imports app.features.* + +Task 4 — CREATE app/shared/feature_frames/tests/test_contract_v2.py: + - MIRROR app/shared/feature_frames/tests/test_contract.py structure + - TEST: pinned constants (EXOGENOUS_LAGS_V2, ROLLING_WINDOWS_V2, …) + - TEST: every column in v2_column_manifest(DEFAULT_V2_GROUPS) is classifiable (no KeyError) + - TEST: enabling a subset of groups produces a strict subset of columns + - TEST: column order is stable and deterministic for the same groups input + - TEST: V2 manifest INCLUDES every V1 column at the SAME relative position (V1-then-extensions order in the V1-group subset) + - TEST: the AST-walk in test_contract.py STILL passes (extend it to walk contract_v2.py + rows_v2.py + sidecar.py) + +Task 5 — CREATE app/shared/feature_frames/tests/test_leakage_v2.py — LOAD-BEARING: + - MIRROR app/shared/feature_frames/tests/test_leakage.py exactly in style + - USE sequential targets (1.0..N.0) so leakage is detectable by arithmetic + - USE disjoint future-target set ({9000.0..9999.0}) — any future-target value appearing in a feature cell is a leak + - TEST for every V2 column: the cell at horizon day j is NaN exactly when its source day > T + - PARAMETRIZE over gap = 0, 3, 7 for the future builder + - TEST: rolling_mean_7 at horizon day j=1 is computable (window T-6..T); at j=2 it is NaN (window touches T+1) + - TEST: lag_364 at j=1 is history_tail[-364] (verified DOW-preserving); at j=365 it is NaN + - TEST: stockout_days_7 at j=1 reads only observed stockout flags; at j=2 it is NaN unless the caller supplies projected stockout flags (and the V2 MVP does NOT support that — so always NaN for j>=2) + - DOCSTRING: load-bearing — must never be weakened to make a feature pass (mirror the V1 spec docstring) + +Task 6 — MODIFY app/shared/feature_frames/__init__.py: + - ADD V2 exports (FEATURE_FRAME_VERSION_V1/V2, EXOGENOUS_LAGS_V2, ROLLING_WINDOWS_V2, …, FeatureGroup, V2ColumnSpec, V2HistoricalSidecar, V2FutureSidecar, v2_column_manifest, canonical_feature_columns_v2, v2_feature_groups_dict, v2_feature_safety_classes, build_historical_feature_rows_v2, build_future_feature_rows_v2) + - KEEP every V1 export at the same position (back-compat) + - DO NOT introduce a circular import — V2 contract module imports nothing from V1 module (they share constants by VALUE, not by re-export) + +Task 7 — MODIFY app/features/forecasting/schemas.py: + - FIND class TrainRequest (line 284) + - INJECT after line containing `config: ModelConfig` two new fields: + feature_frame_version: int = Field(default=1, ge=1, le=2, description="Which feature contract version to build for this training run. 1 = V1 (default, back-compat); 2 = V2 (opt-in, requires regression / additive / tree feature-aware models).") + feature_groups: list[str] | None = Field(default=None, description="When feature_frame_version=2: optional list of FeatureGroup names to enable (None → DEFAULT_V2_GROUPS). When feature_frame_version=1: MUST be None / omitted; supplying any value returns 422.") + - VALIDATE (model_validator, mode="after"): when feature_frame_version == 1 AND feature_groups is not None → raise ValueError("feature_groups is only valid when feature_frame_version=2"). FastAPI surfaces this as a 422 RFC 7807 problem+json — V1 does NOT silently ignore feature_groups. + - VALIDATE (model_validator, mode="after"): when feature_frame_version == 2 AND feature_groups is not None → every string in feature_groups MUST match a FeatureGroup enum value (raise ValueError → 422 with the offending name). When feature_groups is None at V2, the service layer resolves it to DEFAULT_V2_GROUPS. + - DO NOT touch ModelConfigBase or any ModelConfig — preserves all V1 config_hash values byte-for-byte + - PRESERVE: ConfigDict(strict=True) at the model level + - PRESERVE: train_start_date/train_end_date Field(strict=False) override + +Task 8 — CREATE app/features/forecasting/v2_loaders.py: + - DEFINE async load_lifecycle_attrs(db, product_id) -> tuple[date|None, date|None, str|None] + (launch_date, discontinue_date, lifecycle_stage) + - DEFINE async load_inventory_history(db, store_id, product_id, start_date, end_date) -> dict[date, tuple[int, bool]] + Returns: {date: (on_hand_qty, is_stockout)} — TIME-SAFE filter date <= end_date at SQL boundary + - DEFINE async load_replenishment_history(db, store_id, product_id, start_date, end_date) -> tuple[list[date], list[int]] + Returns: (event_dates, event_qty) sorted ascending — TIME-SAFE filter + - DEFINE async load_returns_history(db, store_id, product_id, start_date, end_date) -> dict[date, int] + Returns: {date: total_return_quantity} — TIME-SAFE filter + - DEFINE async load_promotion_history(db, store_id, product_id, start_date, end_date) -> list[PromoSpan] + PromoSpan = (start_date, end_date, kind, discount_pct) — expand to per-day kind sets at caller + - DEFINE async load_exogenous_history(db, store_id, start_date, end_date, signal_names: list[str] | None) -> dict[date, dict[str, float]] + Returns: {date: {signal_name: value}} — TIME-SAFE filter; per-store + global rows merged + - HELPER: assemble_v2_historical_sidecar(...) — pure synchronous assembly of V2HistoricalSidecar from the loader outputs, given the `dates` list + - HELPER: assemble_v2_future_sidecar(...) — pure synchronous assembly of V2FutureSidecar + - PATTERN: mirror app/features/forecasting/service.py:_build_regression_features (uses `select(ColumnSet).where(...).order_by(date)` and `await db.execute(stmt)`) + - SECURITY: every where clause uses SQLAlchemy 2.0 parameter binding (NEVER string concat) + - LOGGING: structlog INFO event per loader on completion with row counts + +Task 9 — MODIFY app/features/forecasting/service.py: + - ADD an enum-style helper `_resolve_feature_frame_version(request_version: int) -> int` (clamp + validate against {1, 2}) + - FIND _build_regression_features (line 515) + - ADD a sibling async method `_build_regression_features_v2(db, store_id, product_id, start_date, end_date, groups: tuple[FeatureGroup, ...]) -> RegressionFeatureMatrix` + - LOAD: sales (already in V1 loader), holidays, promotions (with kind + discount_pct), lifecycle, inventory, replenishment, returns, exogenous (when groups include them) + - ASSEMBLE: V2HistoricalSidecar via the new helper + - BUILD: feature_rows = build_historical_feature_rows_v2(dates=…, quantities=…, prices=…, baseline_price=…, sidecar=…, groups=…) + - history_tail length = HISTORY_TAIL_DAYS_V2 (400) not HISTORY_TAIL_DAYS (90) + - feature_columns = canonical_feature_columns_v2(groups) + - FIND train_model (line 201) + - INJECT a branch on `request.feature_frame_version` (passed in via the routes layer): + if version == 2: + features = await self._build_regression_features_v2(...) + else: + features = await self._build_regression_features(...) # unchanged + - EXTEND extra_metadata (line 254) when features were built via V2: + extra_metadata["feature_frame_version"] = 2 + extra_metadata["feature_groups"] = v2_feature_groups_dict(features.feature_columns) + extra_metadata["feature_safety_classes"] = v2_feature_safety_classes(features.feature_columns) + extra_metadata["feature_pinned_constants"] = {"exogenous_lags": list(EXOGENOUS_LAGS_V2), "rolling_windows": list(ROLLING_WINDOWS_V2), ...} + - EXTEND extra_metadata when V1 (additive, harmless): + extra_metadata["feature_frame_version"] = 1 + - PRESERVE: ModelBundle persistence path; persistence.py is unchanged + - PRESERVE: _build_regression_features signature, return type, and body — byte-stable for V1 callers + +Task 10 — MODIFY app/features/forecasting/routes.py: + - FIND the /forecasting/train handler + - THREAD request.feature_frame_version (and request.feature_groups when version=2) into ForecastingService.train_model + - NO change to /forecasting/predict (predict path is version-agnostic; bundle metadata is self-describing) + +Task 11 — MODIFY app/features/scenarios/feature_frame.py: + - FIND build_future_frame (line 232) + - ADD an optional `feature_frame_version: int = 1` parameter (default = 1 → V1 path unchanged byte-for-byte) + - WHEN version == 2: + - PARSE the requested groups from `feature_columns` (read group via v2_feature_groups_dict reverse mapping) + - LOAD discontinue_date + lifecycle attrs via load_lifecycle_attrs (NOT a forecasting-service call; either move the helper to app/shared or duplicate the tiny query — the latter mirrors the existing same-slice ORM-only pattern at lines 271-281) + - ASSEMBLE V2FutureSidecar: holiday_dates (from Calendar table + assumptions.holiday); price_factor_per_day / promo_active_per_day / promo_kinds_per_day / promo_discount_pct_per_day from assumptions; weather/macro/inventory left None (NaN columns in the future frame are acceptable) + - CALL build_future_feature_rows_v2(...) + - WRAP in FutureFeatureFrame + - PRESERVE: V1 dispatch via the assemble_future_frame path (line 181) is byte-stable + - DO NOT cross-slice-import — keep the lifecycle loader inline in this slice (mirror the data_platform.models import already used at line 55) + +Task 12 — MODIFY app/features/scenarios/service.py: + - FIND the `feature_columns = …` cast at the model_exogenous path (~line 213-222 per the explorer report) + - INJECT a sibling read: feature_frame_version = int(bundle.metadata.get("feature_frame_version", 1)) + - THREAD feature_frame_version into build_future_frame (new optional parameter from Task 11) + - V1 bundles (without the metadata key) default to 1 → byte-stable V1 path + +Task 13 — MODIFY app/features/backtesting/service.py: + - FIND the calls to build_historical_feature_rows (line 493) and build_future_feature_rows (line 553) + - READ feature_frame_version from the fitted bundle BEFORE the fold loop: + version = int(getattr(bundle, "metadata", {}).get("feature_frame_version", 1)) + feature_columns = bundle.metadata.get("feature_columns") if version == 2 else None + - WHEN version == 2: + - BEFORE the per-fold work, load the V2 sidecar data ONCE for the full training window and slice per fold + - CALL build_historical_feature_rows_v2(...) instead of the V1 builder + - PER fold: CALL build_future_feature_rows_v2(..., test_dates=split.test_dates, history_tail=history_tail_slice, gap=split.gap, sidecar=fold_future_sidecar, groups=…) + - WHEN version == 1: unchanged byte-for-byte + - LOGGING: include feature_frame_version in the fold-start log line + +Task 14 — CREATE app/features/forecasting/tests/test_regression_features_v2_leakage.py: + - MIRROR app/features/forecasting/tests/test_regression_features_leakage.py + - SEQUENTIAL targets so leakage is mathematically detectable + - TEST every V2 column emitted by build_historical_feature_rows_v2: cells read strictly earlier observations only + - TEST: with sequential targets, rolling_mean_7 at row i == mean of quantities[i-7..i-1]; NEVER includes quantities[i] or later + - DOCSTRING: LOAD-BEARING — never weaken + +Task 15 — CREATE app/features/forecasting/tests/test_v2_loaders.py (integration, requires docker-compose): + - SEED a minimal fixture: 1 store, 1 product, 60 days of sales + inventory + a handful of replenishment events + returns + exogenous signals + - TEST load_inventory_history: rows beyond cutoff are NOT returned (time-safe) + - TEST load_replenishment_history: same + - TEST load_returns_history: same + - TEST load_exogenous_history: per-store + global rows merge correctly; signal_name filter narrows the result set + +Task 16 — CREATE app/features/forecasting/tests/test_service_v2.py (integration, requires docker-compose): + - End-to-end: POST a V2 TrainRequest, verify the response, load the saved bundle, assert bundle.metadata contains feature_frame_version=2 and the expected feature_columns / feature_groups / feature_safety_classes + - Assert HGBR can fit + predict on the V2 matrix (the existing model code path) + - Assert V1 → V2 → V1 round-trip: a V1 train + V2 train coexist; no shared state mutation + +Task 17 — CREATE app/features/scenarios/tests/test_future_frame_v2_leakage.py: + - MIRROR test_future_frame_leakage.py + - Build a V2 future frame against a synthetic V2 bundle (metadata-only — no real estimator needed) + - Assert: every V2 column whose safety class is CONDITIONALLY_SAFE is NaN at j>=2 unless the corresponding sidecar slice was supplied + - Assert: assumption-driven columns (price_factor, promo_active, promo_discount_pct, promo_kind_*) reflect the assumptions exactly + - Assert: weather/macro columns are NaN when sidecar.*_per_day is empty + +Task 18 — CREATE app/features/backtesting/tests/test_feature_aware_backtest_v2.py: + - End-to-end: train a V2 regression model, run a backtest, verify the fold loop dispatched to rows_v2 (assert a fold-start log carries feature_frame_version=2) + - Verify the fold's X_future has the V2 column count + +Task 19 — CREATE examples/forecasting/feature_frame_v2_preview.py: + - Read-only diagnostic script — given a (store_id, product_id) pair and a cutoff_date, prints: + - V1 feature columns + first 3 rows of the V1 matrix + - V2 feature columns + first 3 rows of the V2 matrix + - Per-group NaN counts in V2 (to flag missing sidecar data on smaller seeded DBs) + - Local-development only — no network egress, no DB writes + +Task 20 — UPDATE docs/optional-features/10-baseforecaster-feature-contract.md: + - ADD a "V2" section after the existing V1 contract documentation + - Document the FeatureGroup enum, the default groups, the safety classes, and the NaN-where-future contract + - Cross-reference test_leakage_v2.py as the load-bearing spec + +Task 21 — UPDATE docs/PHASE/3-FEATURE_ENGINEERING.md and docs/PHASE/4-FORECASTING.md: + - Note: V2 is opt-in via TrainRequest.feature_frame_version=2; V1 remains the default and the back-compat path + +Task 22 — VERIFY no Alembic migration is needed: + - V2 reads only existing tables (inventory_snapshot_daily, replenishment_event, sales_returns, exogenous_signal, promotion, product) + - V2 writes nothing to the DB + - No schema change → no migration. Verify by running `uv run alembic current` and `uv run alembic check` (no pending revisions). +``` + +### Per task pseudocode (the leakage-critical parts) + +```python +# Task 3 — build_historical_feature_rows_v2 (rolling-mean column) +def _rolling_mean_column( + quantities: list[float], + window: int, +) -> list[float]: + """Leakage-safe rolling mean: row i reads quantities[i-window..i-1] ONLY. + The first `window` rows are NaN. + """ + out = [] + for i in range(len(quantities)): + if i < window: + out.append(math.nan) + else: + out.append(sum(quantities[i - window : i]) / window) + return out +# CRITICAL: NEVER include quantities[i] in the slice — that's current-day leakage. + +# Task 3 — build_future_feature_rows_v2 (rolling-mean future column) +def _future_rolling_mean_column( + history_tail: list[float], + horizon: int, + window: int, +) -> list[float]: + """For horizon day j (1..horizon), the rolling-mean source window covers + T+j-window .. T+j-1. If ANY source day > T (i.e. j-1 >= 1), emit NaN. + Equivalently: source covers the future ⟺ horizon day > 1 AND window > 1; + for window=W the j-th horizon day's window is [T+j-W .. T+j-1]. + The window is fully observed ⟺ j-1 <= 0 (only j=1, when the + window is T-W+1..T — all observed). For j >= 2 emit NaN. + """ + out = [] + for j in range(1, horizon + 1): + if j == 1 and len(history_tail) >= window: + out.append(sum(history_tail[-window:]) / window) + else: + out.append(math.nan) + return out +# CRITICAL: This is the canonical V2 NaN-where-future rule for rolling/trend/window-aggregate features. + +# Task 3 — same_dow_mean_4 +def _same_dow_mean_column( + dates: list[date], + quantities: list[float], + n_back: int, +) -> list[float]: + """For row i with weekday w, average the `n_back` most recent earlier + observations whose weekday is also w. NaN when fewer than n_back are + available. + """ + out = [] + for i, day in enumerate(dates): + same_dow = [quantities[j] for j in range(i) if dates[j].weekday() == day.weekday()] + if len(same_dow) >= n_back: + out.append(sum(same_dow[-n_back:]) / n_back) + else: + out.append(math.nan) + return out + +# Task 9 — train_model dispatch (key lines, NOT full code) +async def train_model(self, db, store_id, product_id, train_start_date, train_end_date, config, *, feature_frame_version: int = 1, feature_groups: list[str] | None = None): + model = model_factory(config, random_state=self.settings.forecast_random_seed) + extra_metadata: dict[str, object] = {} + if model.requires_features: + if feature_frame_version == 2: + groups = _resolve_groups(feature_groups) + features = await self._build_regression_features_v2( + db, store_id, product_id, train_start_date, train_end_date, groups=groups, + ) + extra_metadata["feature_frame_version"] = 2 + extra_metadata["feature_groups"] = v2_feature_groups_dict(features.feature_columns) + extra_metadata["feature_safety_classes"] = v2_feature_safety_classes(features.feature_columns) + else: + features = await self._build_regression_features( # unchanged V1 + db, store_id, product_id, train_start_date, train_end_date, + ) + extra_metadata["feature_frame_version"] = 1 # additive; legacy bundles default via .get(..., 1) + model.fit(features.y, features.X) + n_observations = features.n_observations + extra_metadata.update({ + "feature_columns": features.feature_columns, + "history_tail": features.history_tail, + "history_tail_dates": features.history_tail_dates, + "launch_date": features.launch_date_iso, + }) + else: + # … V1 baseline path unchanged … + pass + # … bundle save unchanged … +``` + +### Integration Points + +```yaml +DATABASE: + - migration: NONE — V2 reads only existing tables. Verify with `uv run alembic check`. + - read-only loaders: app/features/forecasting/v2_loaders.py + - time-safe filter: every `where` clause includes `<= cutoff_date` + +CONFIG: + - app/core/config.py: NO new settings keys. V2 reuses forecast_model_artifacts_dir, etc. + - .env.example: unchanged + +ROUTES: + - app/features/forecasting/routes.py: thread request.feature_frame_version and request.feature_groups into ForecastingService.train_model + - app/features/backtesting/routes.py: no change (dispatch happens inside service via bundle metadata) + - app/features/scenarios/routes.py: no change (dispatch happens inside build_future_frame) + - No new endpoint paths + +SCHEMAS: + - app/features/forecasting/schemas.py: + TrainRequest: + + feature_frame_version: int = Field(default=1, ge=1, le=2, description="V1 (default) or V2 feature contract") + + feature_groups: list[str] | None = Field(default=None, description="V2 groups; MUST be None when version=1, else 422") + + @model_validator (mode="after"): when version=1 AND feature_groups is not None → reject (422). When version=2 AND feature_groups is not None → every name must match FeatureGroup (reject unknown names → 422). V1 does NOT silently ignore feature_groups. + - app/features/forecasting/schemas.py (FeatureMetadataResponse): no breaking change — feature_columns already exists; consider adding optional feature_frame_version + feature_groups (purely additive) + +BUNDLE METADATA (additive — no schema migration): + - feature_frame_version: int + - feature_columns: list[str] # already exists for V1 + - feature_groups: dict[str, list[str]] # NEW (V2) + - feature_safety_classes: dict[str, str] # NEW (V2) + - feature_pinned_constants: dict[str, list[int]] # NEW (V2) — for reproducibility audits +``` + +--- + +## Validation Loop + +### Level 1: Syntax & Style + +```bash +# Auto-fix what you can, then re-check +uv run ruff check app/shared/feature_frames app/features/forecasting \ + app/features/backtesting app/features/scenarios --fix +uv run ruff format app/shared/feature_frames app/features/forecasting \ + app/features/backtesting app/features/scenarios +uv run ruff format --check . + +# Strict type checks (BOTH gate merge) +uv run mypy app/ +uv run pyright app/ + +# Expected: zero errors. If errors, READ the message and fix; never silence. +``` + +### Level 2: Pure unit tests (no DB) + +```bash +# V1 leakage spec must stay byte-stable +uv run pytest -v app/shared/feature_frames/tests/test_leakage.py +uv run pytest -v app/features/forecasting/tests/test_regression_features_leakage.py +uv run pytest -v app/features/scenarios/tests/test_future_frame_leakage.py +uv run pytest -v app/features/featuresets/tests/test_leakage.py + +# V2 leakage spec — load-bearing, MUST pass on first green run +uv run pytest -v app/shared/feature_frames/tests/test_leakage_v2.py +uv run pytest -v app/shared/feature_frames/tests/test_contract_v2.py +uv run pytest -v app/features/forecasting/tests/test_regression_features_v2_leakage.py +uv run pytest -v app/features/scenarios/tests/test_future_frame_v2_leakage.py + +# Full pure-Python suite — pretest gate +uv run pytest -v -m "not integration" +# Expected: every test in the V1 baseline passes (unchanged); every new V2 test passes. +``` + +### Level 3: Integration tests (real Postgres) + +```bash +# Ensure docker-compose is up +docker compose up -d +uv run alembic upgrade head +uv run python scripts/check_db.py + +# Verify no new migration was introduced (V2 reads only existing tables) +uv run alembic check +# Expected: "no problems detected" — V2 introduces no schema change. + +# DB-touching V2 tests +uv run pytest -v -m integration app/features/forecasting/tests/test_v2_loaders.py +uv run pytest -v -m integration app/features/forecasting/tests/test_service_v2.py +uv run pytest -v -m integration app/features/backtesting/tests/test_feature_aware_backtest_v2.py +``` + +### Level 4: Smoke — V1 round-trip + V2 happy path against the live demo DB + +```bash +# Start backend (or reuse the running one) +uv run uvicorn app.main:app --reload --port 8123 + +# V1 train (back-compat) — feature_frame_version omitted → defaults to 1 +curl -sS -X POST http://localhost:8123/forecasting/train \ + -H 'Content-Type: application/json' \ + -d '{ + "store_id": 15, "product_id": 52, + "train_start_date": "2025-01-01", "train_end_date": "2025-12-31", + "config": {"model_type": "regression"} + }' | jq . +# Expected: 200; bundle saved; the saved bundle metadata.get("feature_frame_version", 1) == 1. + +# V2 train — opt in +curl -sS -X POST http://localhost:8123/forecasting/train \ + -H 'Content-Type: application/json' \ + -d '{ + "store_id": 15, "product_id": 52, + "train_start_date": "2025-01-01", "train_end_date": "2025-12-31", + "config": {"model_type": "regression"}, + "feature_frame_version": 2, + "feature_groups": ["target_history","rolling","trend","calendar","price_promo","lifecycle"] + }' | jq . +# Expected: 200; bundle metadata carries feature_frame_version=2 with the +# right feature_columns / feature_groups / feature_safety_classes shape. + +# V2 scenario simulation against the V2 bundle (no API change required) +# Slice C will surface this in the UI; here we just confirm the dispatch. +curl -sS -X POST http://localhost:8123/scenarios/simulate \ + -H 'Content-Type: application/json' \ + -d '{ "run_id": "", "horizon": 14, "assumptions": {"price": {"start_date":"2026-01-01","end_date":"2026-01-07","change_pct":-0.15}} }' | jq . +# Expected: 200; method="model_exogenous"; comparison populated. + +# Optional: run the preview script +uv run python examples/forecasting/feature_frame_v2_preview.py --store-id 15 --product-id 52 --cutoff-date 2025-12-31 +``` + +--- + +## Final validation Checklist + +- [ ] V1 leakage spec passes unchanged (`app/shared/feature_frames/tests/test_leakage.py`) +- [ ] V1 forecasting leakage spec passes unchanged (`app/features/forecasting/tests/test_regression_features_leakage.py`) +- [ ] V1 scenarios leakage spec passes unchanged (`app/features/scenarios/tests/test_future_frame_leakage.py`) +- [ ] V1 featuresets leakage spec passes unchanged (`app/features/featuresets/tests/test_leakage.py`) +- [ ] AST-walk leaf-level invariant passes — `app/shared/feature_frames/**` imports nothing from `app/features/**` +- [ ] V2 leakage spec passes on first green run (`app/shared/feature_frames/tests/test_leakage_v2.py`) +- [ ] V2 contract tests pass (`app/shared/feature_frames/tests/test_contract_v2.py`) +- [ ] V2 forecasting integration test passes (`app/features/forecasting/tests/test_service_v2.py`) +- [ ] V2 backtest integration test passes +- [ ] V2 scenarios integration test passes +- [ ] V1 bundle (saved pre-PRP) loads and predicts; bundle.metadata.get("feature_frame_version", 1) == 1 +- [ ] V2 bundle round-trip: save → load → predict (via scenarios) → backtest +- [ ] `uv run ruff check . && uv run ruff format --check .` clean +- [ ] `uv run mypy app/` clean (strict) +- [ ] `uv run pyright app/` clean (strict) +- [ ] `uv run pytest -v -m "not integration"` green +- [ ] `uv run pytest -v -m integration` green (with docker-compose up) +- [ ] `uv run alembic check` — no new migration +- [ ] examples/forecasting/feature_frame_v2_preview.py runs against the local DB +- [ ] No new endpoint paths added +- [ ] No new dependencies in pyproject.toml +- [ ] No managed-cloud SDK introduced +- [ ] No agent tool added (no change to `agent_require_approval`) +- [ ] CHANGELOG entry under "Unreleased" (release-please rules — `feat(forecast): …` → PATCH bump pre-1.0) +- [ ] Manual smoke: V1 curl → 200, V2 curl → 200, both bundles round-trip + +--- + +## Open Design Decisions — RESOLVED in this PRP + +The INITIAL listed open design decisions; each is locked here so the +implementer does not relitigate them. + +| # | Decision | Resolution | Why | +|---|----------|------------|-----| +| 1 | `lag_364` vs `lag_365` | **lag_364** | Verified: 364 = 52×7, preserves day-of-week; 365 shifts DOW (verified with `(date - timedelta(days=364)).weekday() == date.weekday()`). | +| 2 | Recursive rolling vs origin-fixed | **Origin-fixed / NaN-where-future** | The leakage-safe MVP. Any rolling window at horizon day j whose source covers a future day emits NaN. Recursion is a separate, riskier feature (Slice B at earliest). | +| 3 | Stockout: feature only or target rewriting | **Feature only** | Target rewriting changes the loss surface and the metric semantics — needs its own PRP. V2 exposes `is_stockout_lag1` / `stockout_days_7/28` / `inventory_available_ratio_28` as features. | +| 4 | Phase 2 exogenous in V2 MVP or optional | **Optional groups** | Defaults are `(TARGET_HISTORY, ROLLING, TREND, CALENDAR, PRICE_PROMO, LIFECYCLE)`. `INVENTORY`, `REPLENISHMENT`, `RETURNS`, `EXOGENOUS_WEATHER`, `EXOGENOUS_MACRO` are off by default — opt-in via `feature_groups` on the request. Keeps the MVP green on smaller seeded DBs. | +| 5 | UI labelling | **Bundle metadata carries group names** | `feature_groups: dict[str, list[str]]` in bundle metadata maps every column to its group; Slice C consumes this in the UI. No UI code in this PRP. | +| 6 | Where to put `feature_frame_version` | **`TrainRequest` + bundle metadata** | NOT on `ModelConfigBase` — that would change every existing `config_hash()` value and orphan registry rows / aliases. Put it on the request and persist it to bundle metadata. | +| 7 | History tail length for V2 | **400 days** | max(EXOGENOUS_LAGS_V2) + max(ROLLING_WINDOWS_V2) + buffer = 364 + 28 + 8 = 400. V1's 90 is too short for lag_364. | + +--- + +## Anti-Patterns to Avoid + +- ❌ Don't add `feature_frame_version` to `ModelConfigBase` — it changes every V1 hash. +- ❌ Don't recursively project rolling/trend/stockout features into the future — emit NaN. +- ❌ Don't introduce a new SafetyClass enum value — the three existing classes cover every V2 column. +- ❌ Don't import any sibling slice (`forecasting → featuresets`, `backtesting → forecasting`, `scenarios → forecasting`). Use `app/shared/feature_frames` only. +- ❌ Don't silently zero-fill a sidecar cell when a specific day has no source data — emit NaN and let HGBR handle it. Zero is a real demand-domain value (0 returns, 0 stockout days, $0 discount) and zero-filling would corrupt the signal. +- ❌ Don't NaN-fill columns for a DISABLED feature group — omit those columns entirely. Group enablement (controlled by `groups`) decides which columns appear; data presence decides only their values. +- ❌ Don't raise ValueError because a single day inside an enabled group has no data — that's the NaN case. ValueError is reserved for misaligned sidecar array lengths, an empty `groups` parameter, an unknown group name, or a sidecar field that's entirely missing for an enabled group. +- ❌ Don't weaken any existing leakage spec to make a V2 test pass. +- ❌ Don't add an Alembic migration; V2 reads only existing tables. +- ❌ Don't introduce a new endpoint path; opt-in to V2 via the existing `/forecasting/train` body. +- ❌ Don't use SimpleImputer with the default `keep_empty_features=False` (memory `simpleimputer-drops-empty-columns`) — V2 doesn't impute; the matrix carries NaN directly to HGBR. +- ❌ Don't cite `HistGradientBoostingRegressor.feature_importances_` — it does not exist (memory `histgbr-no-feature-importances`). V2 leaves feature-importance extraction untouched in this PRP; that's Slice B / a future PRP. + +--- + +## Confidence + +**Confidence: 8/10** for one-pass implementation success. + +What grounds the 8: +- Every seam is anchored to a file:line, including the surprising ones (backtesting hard-coding `canonical_feature_columns()` at the builder call site; `config_hash()` hashing the full `model_dump_json`). +- Every "open design decision" from the INITIAL is locked with a justification. +- Every cited library default is verified by an executed `uv run python -c …` command, with the output captured in "Known Gotchas". +- The PRP keeps Slice B (new model classes) and Slice C (UI) explicitly out of scope, so the surface stays reviewable. +- V1 byte-stability is enforced by keeping `_build_regression_features` and the V1 builders unchanged; the AST-walk invariant still passes. + +What costs the 2 points: +- The V2 surface is large (≈25 new columns × historical + future builder × leakage tests). A diligent implementer can land it in one branch but it's not a tiny PRP. +- The exact column emission order inside each V2 group has freedom; the PRP locks the group order but allows the implementer to choose within-group ordering as long as the bundle metadata records it. +- Phase 2 sidecar groups (replenishment / returns / exogenous / inventory) are off by default — they get fewer integration tests against the small CI DB. Mitigation: the live local DB (HANDOFF.md — 31,420 replenishment events, 9,647 exogenous signals, 8,585 returns) is sufficient to smoke-test them manually before merge. diff --git a/PRPs/PRP-36-forecast-intelligence-B-model-zoo-backtesting.md b/PRPs/PRP-36-forecast-intelligence-B-model-zoo-backtesting.md new file mode 100644 index 00000000..443b3891 --- /dev/null +++ b/PRPs/PRP-36-forecast-intelligence-B-model-zoo-backtesting.md @@ -0,0 +1,1356 @@ +name: "PRP-36 — Forecast Intelligence B: Model Zoo + Backtesting" +description: | + Promote ForecastLabAI's model layer from "a regression model + 3 baselines" + to a disciplined model zoo with fair, leakage-safe comparison. Slice B of the + Forecast Intelligence roadmap + (`PRPs/INITIAL/INITIAL-forecast-intelligence-index.md`). Slice A (PRP-35 — + Feature Frame V2) is a HARD PREREQUISITE; Slice C (PRP-37 — Interactive UI) + is the downstream consumer of every contract added here. + + > **PREREQUISITE — HARD DEPENDENCY ON PRP-35.** + > This PRP MUST NOT execute until PRP-35 (Feature Frame V2) is merged to + > `dev`. The V2 contract — `feature_frame_version`, `feature_columns`, + > `feature_groups`, `feature_safety_classes`, `feature_pinned_constants` + > in `ModelBundle.metadata`, plus `TrainRequest.feature_frame_version` / + > `feature_groups` — is the load-bearing surface this PRP plugs into. + > Task 1 below is a Contract Refresh gate that verifies PRP-35 actually + > landed and patches any drift between the field names this PRP cites + > and what PRP-35 ultimately shipped. **DO NOT start Task 2 if Task 1 + > flags drift; resolve the drift first.** + +## Purpose +A one-pass implementation contract for an AI agent (or human) with access to +the codebase but no prior session context. Land richer baselines, sharper +metrics, feature-frame-aware backtests, comparable-run logic for registry + +ops, and full explainability metadata — all without weakening any of the four +load-bearing leakage specs and without modifying the V1 builders (frozen by +PRP-35). + +## Core Principles +1. **PRP-35 is the contract.** V2 surface — `FeatureGroup` enum, the V2 builders, + `bundle.metadata.feature_frame_version`, `TrainRequest.feature_frame_version` + — is imported as-is. This PRP NEVER redefines, extends, or shadows it. +2. **`fit(y, X=None)` / `predict(horizon, X=None)` is the only forecaster + contract.** Every new model class implements `BaseForecaster` exactly, + sets `requires_features` correctly, and is dispatched through `model_factory`. +3. **Leakage safety is the central design constraint.** The four load-bearing + leakage specs MUST stay byte-stable. New backtesting code dispatches via + `bundle.metadata.feature_frame_version` (the seam PRP-35 already built); + it never weakens a leakage assertion to fit a new model in. +4. **Deterministic by default.** Every new model takes a `random_state`, + respects `forecast_random_seed`, runs single-threaded (`n_jobs=1` / + `nthread=1`) when the library has thread-nondeterminism. No stochastic + sampling unless explicitly configured AND reproducible. +5. **Comparable-run discipline.** Champion/challenger and stale-alias + detection MUST require: same `(store_id, product_id)` grain AND + overlapping `data_window_*` AND same `feature_frame_version`. A run with + a different feature_frame_version is NOT comparable — promoting one + would silently change the contract the alias points at. +6. **HGBR has no `feature_importances_`.** Verified at runtime (see "Known + Gotchas"). The existing `FeatureImportanceUnavailableError` keeps this + honest; this PRP does not relitigate it. New tree models + (`random_forest` if added) DO expose `feature_importances_` and use it. +7. **Optional extras stay opt-in.** `lightgbm` and `xgboost` are off in the + default environment. New optional model `random_forest` uses + `scikit-learn` (already a core dep) so it can ship without a new extra. + +--- + +## Goal + +Deliver, on branch `feat/forecast-model-zoo-and-backtesting`, an end-to-end +disciplined model zoo against the V2 feature contract that PRP-35 lands: + +- New target-only baseline models `weighted_moving_average` and + `seasonal_average` (always-on); `trend_regression_baseline` OPTIONAL but + scoped here; `random_forest` OPTIONAL feature-aware model (pure-sklearn). +- Conservative, deterministic config tightening for existing feature-aware + models (`regression`, `prophet_like`, `lightgbm`, `xgboost`) — no new + classes, no behavioural surprise for in-flight bundles. +- Backtesting that: + - Compares baselines AND feature-aware models on identical fold boundaries; + - Routes each fold to the V1 or V2 row builder via `bundle.metadata. + feature_frame_version` (dispatch already added by PRP-35 Task 13); + - Returns `RMSE` alongside MAE / sMAPE / WAPE / bias; + - Returns per-horizon-bucket metrics (`h_1_7`, `h_8_14`, `h_15_28`, `h_29+`). +- Registry + ops that: + - Persist `feature_frame_version` + `feature_groups` to every new + `model_run.runtime_info`, AND surface them on `RunResponse` / + `RunDetailResponse`; + - Restrict the "comparable run" predicate to `(grain, overlapping + data_window, same feature_frame_version)`; + - Mark a stale-alias reason `feature_frame_version_mismatch` when the + alias's run is V1 but a newer comparable V2 SUCCESS run exists (and + vice versa). +- Explainability that: + - Recognises every new model_type in `_MODEL_FAMILY_MAP`; + - Preserves the additive decomposition for `prophet_like`; + - Preserves simple arithmetic explanations for baselines; + - Exposes `feature_importances_` for `random_forest` (when added) — never + cites it for HGBR. +- Artifact hash verification intact (no change to `bundle_hash` flow). +- All five validation gates green. + +## Why + +Today the model zoo is heavily backloaded onto the four feature-aware models; +the three target-only baselines are weak comparators (`naive` = +last-observation, `seasonal_naive` = single-cycle copy, `moving_average` = +flat mean). After PRP-35 unlocks 25+ richer V2 columns, planners need: + +- Stronger baselines (so "extra complexity is justified" actually means + something). +- Per-horizon metrics (a model that wins WAPE on h=1..7 but loses on h=29+ + is a different operational tool than one that's even across the horizon). +- A way to compare same-grain same-window runs across feature_frame_version + without accidentally promoting a V1 alias over a V2 challenger. +- Honest feature-importance plumbing — including the "feature importance is + unavailable for HGBR; use permutation_importance" path PRP-31 / issue + #258 added — so Slice C's UI never invents a number that doesn't exist. + +## What + +### User-visible behaviour + +- `POST /forecasting/train` accepts new `model_type` values: + `weighted_moving_average`, `seasonal_average` (always), and OPTIONALLY + `trend_regression_baseline`, `random_forest`. +- `POST /forecasting/predict` still rejects feature-aware models without `X` + (no change to that contract). +- `POST /backtesting/run` returns: + - The existing aggregate metrics (MAE, sMAPE, WAPE, bias, stability) PLUS + `rmse`. + - A NEW per-fold `horizon_bucket_metrics: dict[str, dict[str, float]]` + block keyed by bucket id (`h_1_7`, `h_8_14`, `h_15_28`, `h_29+`) with + the same metric names inside each bucket. +- `GET /registry/runs/{run_id}` exposes + `feature_frame_version` + `feature_groups` on the response (additive — + optional fields, default to V1 when absent). +- `GET /ops/model-health` and the stale-alias view classify a champion + alias as `stale` with `reason=feature_frame_version_mismatch` when a + newer comparable SUCCESS run on a different feature_frame_version exists. +- `GET /explain/runs/{run_id}` works for every NEW baseline (simple + arithmetic explanation) AND for `random_forest` (tree feature + importances). + +### Technical requirements + +- Pydantic v2 strict mode on every new request schema + (`ConfigDict(strict=True)` + `Field(strict=False, ...)` for + date / datetime / UUID / Decimal — see `docs/_base/SECURITY.md` § + "Pydantic v2 strict mode on FastAPI request bodies"). Enforced by the + AST-walker invariant in `app/core/tests/test_strict_mode_policy.py`. +- All new SQL uses SQLAlchemy 2.0 parameter binding. +- All five validation gates pass: `ruff check` + `ruff format --check` + + `mypy --strict` + `pyright --strict` + `pytest -m "not integration"` + + `pytest -m integration`. +- No new Alembic migration (verified by `alembic check`): feature + metadata rides in existing JSONB columns (`model_run.runtime_info`, + `model_run.metrics`). +- No new endpoint paths — existing endpoints gain additive optional fields. +- No managed-cloud SDK introduced. No AutoML. No hyperparameter sweep. + +### Success Criteria + +- [ ] Contract Refresh (Task 1) succeeds: the V2 symbols PRP-35 promised + ALL import cleanly, AND every field name this PRP assumes matches what + PRP-35 actually shipped. +- [ ] `weighted_moving_average` model trains, predicts, persists, loads. +- [ ] `seasonal_average` model trains, predicts, persists, loads. +- [ ] If included: `trend_regression_baseline` trains/predicts/persists/loads. +- [ ] If included: `random_forest` trains/predicts/persists/loads AND + exposes `feature_importances_` through `extract_feature_importance`. +- [ ] `BacktestResponse.main_model_results.fold_results[*]` carries a + `horizon_bucket_metrics` block; baseline AND feature-aware backtests + run on identical folds and return mutually-comparable summaries. +- [ ] `BacktestResponse.main_model_results.aggregate_metrics` carries + `rmse` alongside the existing four metrics. +- [ ] Backtesting routes V1 bundles through the V1 builder path and V2 + bundles through the V2 builder path — the dispatch PRP-35 Task 13 + added — and a V2 fold's `X_future` matches the V2 column count from + `bundle.metadata.feature_columns`. +- [ ] V2 leakage spec at the backtesting layer + (`app/features/backtesting/tests/test_feature_aware_backtest_v2.py`, + introduced by PRP-35) stays green; this PRP adds NO weakening edits. +- [ ] `RegistryService._find_duplicate` includes + `feature_frame_version` in its match key (an existing V1 run is NOT a + duplicate of a new V2 run with the same other fields). +- [ ] `RegistryService.create_alias` keeps the "run.status == SUCCESS" + precondition; aliases on V1 runs continue to work. +- [ ] `OpsService` comparable-run selection requires same grain, overlapping + data window, AND same feature_frame_version. +- [ ] A V1 alias whose grain has a newer V2 SUCCESS run reports + `is_stale=true, reason=feature_frame_version_mismatch`. +- [ ] `_MODEL_FAMILY_MAP` covers every new model_type; unknown family + fallback path (existing) untouched. +- [ ] `extract_feature_importance` accepts the new feature-aware class + (when `random_forest` added) and returns a 1-D importance vector of + shape `(len(feature_columns),)`. HGBR remains the only feature-aware + class that raises `FeatureImportanceUnavailableError`. +- [ ] `app/features/explainability` builds simple arithmetic explanations + for every new baseline (the same shape it already builds for `naive`, + `seasonal_naive`, `moving_average`). +- [ ] All five validation gates green. +- [ ] All four load-bearing leakage specs unchanged. +- [ ] `uv run alembic check` — no new migration. +- [ ] An `examples/forecasting/model_zoo_compare.py` script runs against + the local seeded DB and prints a per-model metrics table. + +--- + +## All Needed Context + +### Documentation & References + +```yaml +# ─── PRP-35 SURFACE — load first; everything downstream depends on it ──── +- file: PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md + why: The V2 contract. This PRP imports `FeatureGroup`, the V2 builders, and the bundle.metadata fields PRP-35 added. + +- file: app/shared/feature_frames/contract_v2.py # CREATED BY PRP-35 + why: Source of FEATURE_FRAME_VERSION_V2, FeatureGroup, DEFAULT_V2_GROUPS, v2_column_manifest, v2_feature_groups_dict, v2_feature_safety_classes. + +- file: app/shared/feature_frames/rows_v2.py # CREATED BY PRP-35 + why: build_historical_feature_rows_v2 / build_future_feature_rows_v2. + +- file: app/features/forecasting/v2_loaders.py # CREATED BY PRP-35 + why: async sidecar loaders for inventory / replenishment / returns / exogenous / promotion / lifecycle. Reused by the model_zoo backtest path; never duplicated. + +# ─── Forecasting model layer ──────────────────────────────────────────── +- file: app/features/forecasting/models.py + why: BaseForecaster (L109 `requires_features` ClassVar, L129 fit, L148 predict). NaiveForecaster L196, SeasonalNaiveForecaster L281, MovingAverageForecaster L384, RegressionForecaster L483 (HistGradientBoostingRegressor), LightGBMForecaster L625 (lazy import L706), XGBoostForecaster L787 (lazy import L870), ProphetLikeForecaster L950 (Ridge pipeline; `decompose()` L1069). `model_factory(config, random_state)` L1138-1227 (if-elif dispatch; lightgbm gate L1178, xgboost gate L1193). New model classes mirror the existing pattern. + +- file: app/features/forecasting/schemas.py + why: ModelConfigBase L23-51 (frozen=True; `config_hash()` L43-50). NaiveModelConfig L53, SeasonalNaiveModelConfig L66, MovingAverageModelConfig L87, LightGBMModelConfig L108, XGBoostModelConfig L148, RegressionModelConfig L191, ProphetLikeModelConfig L236. `ModelConfig` discriminated union L268-276 (discriminator=`model_type`). TrainRequest L284. FeatureMetadataResponse L462. ModelFamily enum L422-435. + +- file: app/features/forecasting/feature_metadata.py + why: `_MODEL_FAMILY_MAP` L42-50 — must be extended with every new model_type. `model_family_for(model_type)` L53-69 logs a warning and defaults BASELINE for unknowns (forward-compat, but every NEW model_type added here MUST appear in the map to avoid the warning in CI). `FeatureImportanceUnavailableError` L72-83 — the HGBR-specific 422 path; NEVER weaken. `importance_type_for(model)` L86-108. `extract_feature_importance(model, feature_columns)` L111-228 — sklearn imputer realignment for ProphetLike L169-200 (per memory `simpleimputer-drops-empty-columns`). + +- file: app/features/forecasting/persistence.py + why: ModelBundle dataclass L31-76 (metadata: dict[str, object] — additive; no schema change for any new field). save_model_bundle L78-133 (auto-populates created_at, sklearn/lightgbm/xgboost versions, bundle_hash). load_model_bundle L136-235 (path-traversal guard L157-171; version-mismatch warnings L178-226). + +- file: app/features/forecasting/service.py + why: ForecastingService.train_model L201 — branches on `requires_features` L244 and dispatches to V1 or V2 builder per PRP-35 Task 9. `_assemble_regression_rows` L132-182 (delegates to `build_historical_feature_rows`). `RegressionFeatureMatrix` L109-130. Constant `_MIN_REGRESSION_TRAIN_ROWS = 30` at L99. New target-only models bypass the feature-build branch entirely. + +- file: app/features/forecasting/routes.py + why: POST /forecasting/train handler ~L55-145 — flag-gates LightGBM and XGBoost (`forecast_enable_lightgbm` / `forecast_enable_xgboost`). New baselines do NOT need flag-gates. `random_forest` (if added) is an additional pure-sklearn model — no gate. + +# ─── Backtesting layer ────────────────────────────────────────────────── +- file: app/features/backtesting/service.py + why: BacktestingService.run_backtest L213 — validates config L240, loads series data L259, branches on `requires_features` L280, calls `_load_exogenous_frame()` L281. The V1 builder calls live at L493 (build_historical_feature_rows) and L553 (build_future_feature_rows) — PRP-35 Task 13 already added the V1/V2 dispatch around those sites. ExogenousFrame L65-87. `_MIN_FEATURE_AWARE_TRAIN_ROWS = 30` L61. Imports `build_historical_feature_rows`, `build_future_feature_rows` from `app.shared.feature_frames` at L46-50. + +- file: app/features/backtesting/metrics.py + why: MetricsCalculator with `mae` L57, `smape` L90, `wape` L148, `bias` L195, `stability_index` L242, `calculate_all` L294, `aggregate_fold_metrics` L315. `EPSILON = 1e-10` L54. **RMSE does NOT exist today** — added by this PRP. Per-horizon-bucket metrics do NOT exist today — added by this PRP. + +- file: app/features/backtesting/schemas.py + why: BacktestRequest L198-231. BacktestResponse L233-259 (`main_model_results`, `baseline_results`, `comparison_summary`, `leakage_check_passed`). FoldResult L147-165 (`fold_index`, `split: SplitBoundary`, `dates`, `actuals`, `predictions`, `metrics: dict[str, float]`). New per-horizon-bucket field is added to FoldResult and reflected in the aggregate. + +# ─── Registry / Ops ───────────────────────────────────────────────────── +- file: app/features/registry/models.py + why: ModelRun ORM L51-142 (run_id 32-char hex UUID; status RunStatus enum L36-49; `model_config` JSONB; `feature_config` JSONB nullable; `data_window_start/end`; `metrics` JSONB; `runtime_info` JSONB — feature_frame_version + feature_groups ride here). DeploymentAlias ORM L145-168. + +- file: app/features/registry/service.py + why: RegistryService.create_run L183-261. update_run L357-419. **_find_duplicate L629-672 — TODAY MATCHES ON (config_hash, store_id, product_id, data_window_start, data_window_end) ONLY.** This PRP extends the match key with feature_frame_version. create_alias / update_alias L421-495 (status == SUCCESS precondition — preserved). list_aliases L534-565. + +- file: app/features/registry/schemas.py + why: RunResponse / RunDetailResponse L118-167 — exposes `model_config_data`, `feature_config`, `config_hash`, `data_window_*`, `metrics`, `artifact_*`, `runtime_info`, `error_message`, timestamps. **TODAY DOES NOT EXPOSE feature_frame_version OR feature_groups** — added by this PRP as additive optional fields. + +- file: app/features/ops/service.py + why: Stale-alias detection `_alias_staleness(run, latest_success_by_grain)` L137-159 — currently stale iff `run.status != SUCCESS OR newer SUCCESS run exists for same (store_id, product_id)`. **TODAY READS ZERO FEATURE METADATA.** This PRP extends the comparable-run selection (L412-427) AND the staleness rule to honour feature_frame_version. Model-health classification `drift_direction ∈ {degrading, improving, stable, unknown}` L464-543 (rank map L534). + +- file: app/features/ops/routes.py + why: GET /ops/model-health and GET /ops/stale-aliases handlers — additive response fields, no path change. + +# ─── Explainability ───────────────────────────────────────────────────── +- file: app/features/explainability/service.py + why: TODAY HANDLES BASELINE ONLY (naive, seasonal_naive, moving_average). `explainer_factory` L205 rejects feature-aware with 400. Baseline explainers produce simple arithmetic explanations (last-value, season mean, moving-avg). New baselines MUST get explainers in the same shape. + +- file: app/features/explainability/explainers.py + why: Individual baseline explainer classes — pattern for new ones. Drives `ForecastExplanation` shape from `schemas.py`. + +- file: app/features/explainability/reason_codes.py + why: Retail signal warnings (correlation, not causation). Untouched by this PRP — preserved verbatim. + +# ─── Configuration ────────────────────────────────────────────────────── +- file: app/core/config.py + why: forecast_random_seed L97 (=42); forecast_default_horizon L98 (=14); forecast_max_horizon L99 (=90); forecast_model_artifacts_dir L100; forecast_enable_lightgbm L101 (=False); forecast_enable_xgboost L102 (=False). No new keys needed; `forecast_enable_random_forest` is OPTIONAL — only add it if `random_forest` ships in this PRP. Per the rule, it defaults False. + +- file: pyproject.toml + why: `[project.optional-dependencies]` L34-50 — `ml-lightgbm = ["lightgbm>=4.5.0"]` L47, `ml-xgboost = ["xgboost>=2.1.0"]` L50. NO new extra needed for `random_forest` (uses sklearn, already a core dep). NO new extra for the new baselines (pure numpy / stdlib). + +# ─── Rules ────────────────────────────────────────────────────────────── +- file: docs/_base/RULES.md + why: Never weaken leakage specs; never edit a merged migration; never widen agent mutation surface; never `git push --force`. None violated by this PRP. + +- file: .claude/rules/test-requirements.md + why: Every new model class + new metric + new schema field ships with a unit test; every new endpoint behavior ships with a route test; every bug fix ships a regression test. + +- file: .claude/rules/commit-format.md + why: Commit scope must match the dominant touched area. This PRP touches forecast / backtest / registry / ops / explainability — use a comma-pair scope: `feat(forecast,backtest): …` for the model + metrics work, `feat(registry,ops): …` for the comparability work, `feat(forecast,api): …` if the response shape changes hit the API surface. Each commit MUST reference the tracking issue. + +# ─── Library / API references (load on demand) ────────────────────────── +- url: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html + section: "Parameters" + "Attributes" + critical: `n_estimators` default 100; `random_state` and `n_jobs=1` for deterministic fits; `feature_importances_` is the 1-D Gini importance vector (verified — shape `(n_features,)`). + +- url: https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html + section: "Notes" + critical: The documented replacement for "tree models without feature_importances_". HGBR explainability uses this (or — if too slow — punts to the existing FeatureImportanceUnavailableError). DO NOT add permutation_importance behind /explain in this PRP; the existing 422 path is the contract until a separate PRP funds the compute budget. + +- url: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html + section: "Notes" + critical: Existing splitter is already gap-aware (see `app/features/backtesting/splitter.py`). No change to the splitter; only the per-fold metric output. + +- url: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html + section: "Parameters" + "Attributes" + critical: `deterministic=True` + `n_jobs=1` + `seed=random_state` for bit-reproducible fits. Library is OPT-IN (`pyproject.toml` extra); see "Known Gotchas" for the find_spec guard. + +- url: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor + section: "Parameters" + critical: `tree_method="hist"` (deterministic) + `n_jobs=1` + `random_state=random_state` + `verbosity=0`. Library is OPT-IN. + +- url: https://facebook.github.io/prophet/docs/seasonality%2C_holiday_effects%2C_and_regressors.html + section: "Additional regressors" + critical: Vocabulary inspiration only — the in-repo ProphetLikeForecaster is a Ridge additive pipeline, NOT real Prophet. `decompose()` returns the trend / seasonality / regressor components from the Ridge coefficients (`app/features/forecasting/models.py:1069`). + +- url: https://unit8co.github.io/darts/userguide/covariates.html + section: "Past vs Future Covariates" + critical: Useful framing for the per-horizon-bucket metric labels in Slice C. Not loaded at runtime here. + +- url: https://nixtlaverse.nixtla.io/statsforecast/src/core/models.html + section: "WeightedAverage" + "SeasonalAverage" + critical: Vocabulary alignment — `weighted_moving_average` and `seasonal_average` are not novel; pin the existing nomenclature in docstrings. + +# ─── Memory anchors (load on conflict) ────────────────────────────────── +- memory: histgbr-no-feature-importances + why: HGBR has no `feature_importances_` — verified at runtime in this PRP's "Known Gotchas". The existing FeatureImportanceUnavailableError path stays. + +- memory: simpleimputer-drops-empty-columns + why: ProphetLikeForecaster handles this in `extract_feature_importance` (L169-200 in feature_metadata.py). Any new pipeline that uses SimpleImputer MUST pass `keep_empty_features=True` OR replicate the imputer-statistics realignment. + +- memory: computed-field-cross-slice-cycle + why: `RunResponse.model_family` is a Pydantic computed_field whose return type lives in `forecasting`. The lazy in-method import pattern stays; new RunResponse fields MUST NOT introduce a similar cycle. + +- memory: scenario-run-id-vs-registry-run-id + why: Scenarios `/scenarios/simulate` uses the forecast-artifact `run_id` (model_{id}.joblib), NOT the registry `model_run.run_id`. Stays load-bearing for ops/comparable-run logic — do not conflate. + +- memory: data-platform-shared-orm-layer + why: CodeRabbit flags cross-slice imports of `data_platform.models`. This PRP keeps the existing pattern; it does NOT refactor. +``` + +### Current Codebase tree (relevant after PRP-35 merges) + +``` +app/ +├── shared/ +│ └── feature_frames/ +│ ├── __init__.py # V1 + V2 surface (PRP-35) +│ ├── contract.py # V1 (frozen) +│ ├── contract_v2.py # V2 (PRP-35) +│ ├── rows.py # V1 (frozen) +│ ├── rows_v2.py # V2 (PRP-35) +│ ├── sidecar.py # V2 (PRP-35) +│ └── tests/ +│ ├── test_contract.py +│ ├── test_contract_v2.py +│ ├── test_leakage.py # load-bearing +│ └── test_leakage_v2.py # load-bearing (PRP-35) +├── features/ +│ ├── forecasting/ +│ │ ├── models.py # BaseForecaster, 7 forecasters, model_factory +│ │ ├── schemas.py # ModelConfig union; TrainRequest +│ │ ├── persistence.py # ModelBundle.metadata dict[str, object] +│ │ ├── service.py # train_model + V1/V2 dispatch (PRP-35) +│ │ ├── feature_metadata.py # _MODEL_FAMILY_MAP + extract_feature_importance +│ │ ├── v2_loaders.py # PRP-35 — reused here +│ │ └── routes.py +│ ├── backtesting/ +│ │ ├── service.py # fold loop + V1/V2 dispatch (PRP-35 Task 13) +│ │ ├── metrics.py # MetricsCalculator (mae/smape/wape/bias/stability) +│ │ ├── schemas.py # FoldResult + BacktestResponse +│ │ └── splitter.py # TimeSeriesSplit-style +│ ├── registry/ +│ │ ├── models.py # ModelRun + DeploymentAlias +│ │ ├── schemas.py # RunResponse / RunDetailResponse +│ │ ├── service.py # _find_duplicate + create_alias +│ │ └── routes.py +│ ├── ops/ +│ │ ├── service.py # stale-alias + model-health +│ │ ├── schemas.py +│ │ └── routes.py +│ └── explainability/ +│ ├── service.py # baselines only today +│ ├── explainers.py +│ └── reason_codes.py +└── core/ + └── config.py +``` + +### Desired Codebase tree (new + modified files) + +``` +app/ +├── features/ +│ ├── forecasting/ +│ │ ├── models.py # MODIFIED — add WeightedMovingAverageForecaster, SeasonalAverageForecaster, [optional] TrendRegressionBaselineForecaster, [optional] RandomForestForecaster + factory dispatch +│ │ ├── schemas.py # MODIFIED — add WeightedMovingAverageModelConfig, SeasonalAverageModelConfig, [optional] TrendRegressionBaselineModelConfig, [optional] RandomForestModelConfig + extend ModelConfig union +│ │ ├── feature_metadata.py # MODIFIED — extend _MODEL_FAMILY_MAP with new model_types; extend extract_feature_importance to recognise RandomForestForecaster +│ │ ├── service.py # MODIFIED — train_model branch for new target-only models (no feature build); persist feature_frame_version + feature_groups when V2 (additive over PRP-35) +│ │ └── tests/ +│ │ ├── test_weighted_moving_average_forecaster.py # NEW +│ │ ├── test_seasonal_average_forecaster.py # NEW +│ │ ├── test_trend_regression_baseline_forecaster.py # NEW (optional) +│ │ ├── test_random_forest_forecaster.py # NEW (optional) +│ │ ├── test_feature_metadata.py # MODIFIED — assert new model_types map to families; assert random_forest exposes feature_importances_ +│ │ └── test_models.py # MODIFIED — factory dispatch table covers new types +│ ├── backtesting/ +│ │ ├── metrics.py # MODIFIED — add MetricsCalculator.rmse + bucket_metrics helper +│ │ ├── service.py # MODIFIED — emit per-fold horizon_bucket_metrics + per-bucket aggregates +│ │ ├── schemas.py # MODIFIED — FoldResult gains horizon_bucket_metrics; aggregate gains rmse + bucketed dict +│ │ └── tests/ +│ │ ├── test_metrics.py # MODIFIED — rmse + bucket helper unit tests +│ │ ├── test_service.py # MODIFIED — assert bucketed payload shape +│ │ └── test_feature_aware_backtest_v2.py # PRP-35 — unchanged; new tests do NOT weaken +│ ├── registry/ +│ │ ├── service.py # MODIFIED — _find_duplicate includes feature_frame_version; comparable_runs predicate (new helper) +│ │ ├── schemas.py # MODIFIED — RunResponse / RunDetailResponse expose feature_frame_version + feature_groups (Optional) +│ │ └── tests/ +│ │ ├── test_service.py # MODIFIED — V1-vs-V2 not a duplicate; comparable_runs helper tests +│ │ └── test_schemas.py # MODIFIED — new fields round-trip +│ ├── ops/ +│ │ ├── service.py # MODIFIED — comparable-run selection by (grain, overlap window, same V); add stale-reason `feature_frame_version_mismatch` +│ │ ├── schemas.py # MODIFIED — stale-reason enum extended; comparable-run metadata exposed +│ │ └── tests/ +│ │ ├── test_service.py # MODIFIED — assert stale-reason mismatch path; assert V1 alias not compared to V2 newer run as "degrading" +│ │ └── test_routes_integration.py # MODIFIED — happy path + mismatch path +│ └── explainability/ +│ ├── service.py # MODIFIED — register new baseline explainers; route `random_forest` to existing tree feature-importance path through extract_feature_importance +│ ├── explainers.py # MODIFIED — WeightedMovingAverageExplainer + SeasonalAverageExplainer + (optional) TrendRegressionBaselineExplainer +│ └── tests/ +│ ├── test_explainers.py # MODIFIED — new explainer classes +│ └── test_service.py # MODIFIED — service routes new model_types correctly +└── examples/ + └── forecasting/ + └── model_zoo_compare.py # NEW — small local sweep + per-model metrics + registry candidate summary +``` + +### Known Gotchas of our codebase & Library Quirks + +```python +# ───────────────────────────────────────────────────────────────────────── +# CRITICAL: PRP-35 prerequisite — Task 1 (Contract Refresh) is the gate. +# ───────────────────────────────────────────────────────────────────────── +# +# If `from app.shared.feature_frames import (FEATURE_FRAME_VERSION_V2, +# FeatureGroup, build_historical_feature_rows_v2)` fails — STOP. PRP-35 has +# not landed. Do not execute any later task. +# +# If those imports succeed but PRP-35 shipped a different field name in +# bundle.metadata (e.g. `feature_safety` instead of `feature_safety_classes`), +# Task 1 PATCHES the names cited in this PRP before any code is written. +# +# ───────────────────────────────────────────────────────────────────────── +# Library verifications (executed at PRP-create time on the live env — +# sklearn 1.8.0, numpy 2.4.1, pandas 3.0.3). Re-verify after any library +# bump. Verification commands: +# ───────────────────────────────────────────────────────────────────────── + +# VERIFIED: HistGradientBoostingRegressor has NO `feature_importances_` +# uv run python -c " +# from sklearn.ensemble import HistGradientBoostingRegressor +# m = HistGradientBoostingRegressor() +# m.fit([[1.0],[2.0],[3.0]], [1.0,2.0,3.0]) +# print('HAS_attr:', hasattr(m, 'feature_importances_')) +# " +# Output: HAS_attr: False +# IMPLICATION: `extract_feature_importance` MUST continue to raise +# FeatureImportanceUnavailableError for RegressionForecaster. This PRP +# does NOT relitigate that contract. +# +# VERIFIED: RandomForestRegressor has `feature_importances_` as a 1-D vector +# uv run python -c " +# from sklearn.ensemble import RandomForestRegressor +# m = RandomForestRegressor(n_estimators=3, random_state=42, n_jobs=1) +# m.fit([[1.0,2.0],[2.0,1.0],[3.0,3.0],[4.0,2.0]], [1.0,2.0,3.0,4.0]) +# print('HAS_attr:', hasattr(m, 'feature_importances_'), +# 'NDIM:', m.feature_importances_.ndim, +# 'SHAPE:', m.feature_importances_.shape) +# " +# Output: HAS_attr: True NDIM: 1 SHAPE: (2,) +# IMPLICATION: RandomForestForecaster (optional) reuses the existing +# tree-importance branch in extract_feature_importance (L147-164) — +# just add `RandomForestForecaster` to the isinstance check. +# +# VERIFIED: RandomForestRegressor deterministic with random_state=42, n_jobs=1 +# uv run python -c " +# import numpy as np +# from sklearn.ensemble import RandomForestRegressor +# X = np.array([[i, i%7] for i in range(60)], dtype=float) +# y = np.array([float(i) for i in range(60)]) +# a = RandomForestRegressor(n_estimators=5, random_state=42, n_jobs=1).fit(X, y).predict([[60.0, 4.0]]) +# b = RandomForestRegressor(n_estimators=5, random_state=42, n_jobs=1).fit(X, y).predict([[60.0, 4.0]]) +# print('EQ:', np.array_equal(a, b)) +# " +# Output: EQ: True +# IMPLICATION: random_state + n_jobs=1 is the deterministic recipe. Use +# it in RandomForestForecaster.__init__; never set n_jobs > 1. +# +# VERIFIED: np.average(vals, weights=...) supports both linear + exponential +# uv run python -c " +# import numpy as np +# vals = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) +# weights_linear = np.arange(1, len(vals)+1) +# print('LIN_WMA:', np.average(vals, weights=weights_linear)) +# weights_exp = np.power(0.5, np.arange(len(vals)-1, -1, -1)) +# print('EXP_WMA:', np.average(vals, weights=weights_exp)) +# " +# Output: LIN_WMA: 3.666... EXP_WMA: 4.161... +# IMPLICATION: WeightedMovingAverageForecaster uses np.average with either +# a "linear" or "exponential" weights strategy. Coverage: both branches +# in unit tests. +# +# VERIFIED: Ridge deterministic by construction (closed-form solver) +# uv run python -c " +# import numpy as np +# from sklearn.linear_model import Ridge +# X = np.array([[i, i%7] for i in range(60)], dtype=float) +# y = np.array([float(i) for i in range(60)]) +# a = Ridge(random_state=42).fit(X, y).coef_ +# b = Ridge(random_state=42).fit(X, y).coef_ +# print('EQ:', np.array_equal(a, b)) +# " +# Output: EQ: True +# IMPLICATION: TrendRegressionBaselineForecaster (optional, Ridge-based) +# does not need n_jobs=1 to be deterministic. + +# VERIFIED: lightgbm + xgboost are NOT installed in the default venv +# uv run python -c " +# import importlib.util +# print('lightgbm:', importlib.util.find_spec('lightgbm') is not None, +# 'xgboost:', importlib.util.find_spec('xgboost') is not None) +# " +# Output: lightgbm: False xgboost: False +# IMPLICATION: This PRP does NOT add new lightgbm/xgboost code paths that +# require the libraries to be importable at module load time. ALL new +# model configurations for the existing lightgbm/xgboost forecasters +# adjust DEFAULTS in `LightGBMModelConfig` / `XGBoostModelConfig`. They +# stay behind the existing `forecast_enable_*` flags and the existing +# lazy-import-in-fit pattern (`models.py` L706 / L870). Unit tests for +# the config tightening do NOT require the libraries; integration tests +# that fit a real model MUST `pytest.mark.skipif(not importlib.util. +# find_spec("lightgbm"), reason="lightgbm extra not installed")`. + +# ───────────────────────────────────────────────────────────────────────── +# Repo-specific failure modes (anchored in memory + prior PRPs): +# ───────────────────────────────────────────────────────────────────────── + +# - model_run.metrics is JSONB; nested dicts round-trip fine. BUT date / +# datetime values DO NOT — store dates as ISO strings (the existing +# pattern is `bundle.metadata["created_at"] = datetime.now(UTC).isoformat()`). +# - `RegistryService._find_duplicate` is called from RegistryService.create_run +# BEFORE the run is persisted; adding feature_frame_version to its +# match key needs the V flag passed in via RunCreate.runtime_info — the +# forecasting service already populates runtime_info from +# `extra_metadata` (PRP-35 Task 9). Confirm during Task 1. +# - `RunResponse.model_family` is a Pydantic computed_field whose return +# type lives in forecasting. Adding `feature_frame_version` and +# `feature_groups` to RunResponse MUST NOT introduce a similar cross- +# slice cycle. Both new fields are plain Python types (int / dict[str, +# list[str]]) so no import is needed (memory `computed-field-cross-slice- +# cycle`). +# - Per-horizon-bucket metric names are stable string keys; do NOT keep +# them as enums in the response JSON (TypeScript on the Slice C side +# would have to map them). Bucket ids: "h_1_7", "h_8_14", "h_15_28", +# "h_29_plus". +# - When OPTIONAL libraries are missing, route handlers MUST surface a +# 422 RFC 7807 with `detail="lightgbm extra not installed; install with +# uv sync --extra ml-lightgbm and set forecast_enable_lightgbm=true"` — +# never a 500. The existing flag-gate check in forecasting/routes.py is +# the pattern. +# - `app/shared/feature_frames/**` remains leaf-level — backtesting and +# forecasting service may import from it; it MAY NOT import from any +# features slice (the AST-walk invariant catches this). +# - `make demo` / `scripts/run_demo.py` use the demo seeder and the existing +# model types. Confirm none of the new model types break the demo path +# — they shouldn't (demo trains naive / seasonal_naive / moving_average +# per `app/features/demo/pipeline.py`). DO NOT change the demo to use new +# models; that's Slice C territory. +# - Per-horizon-bucket aggregates MUST skip empty buckets (h_29+ on a 14-day +# forecast is empty); the aggregate returns NaN for empty bucket values +# AND drops empty buckets from the per-fold dict. Mirror the existing +# sMAPE / WAPE empty-array handling at metrics.py L78. +# - `bundle_hash` is computed from the model class + config dict; tightening +# the DEFAULTS on existing configs changes the hash for newly-trained +# models. Old bundles (with the old defaults) MUST still load + predict +# identically. The existing schema-version field at ModelConfigBase L37-41 +# IS the canary: bumping it triggers re-train; this PRP does NOT bump it. +``` + +--- + +## Implementation Blueprint + +### Data models and structure + +```python +# ─── app/features/forecasting/schemas.py — new ModelConfigs (additive) ─── + +class WeightedMovingAverageModelConfig(ModelConfigBase): + """Target-only baseline: weighted average of last N observations.""" + model_type: Literal["weighted_moving_average"] = "weighted_moving_average" + window_size: int = Field(default=7, ge=2, le=90) + weight_strategy: Literal["linear", "exponential"] = "linear" + # 'linear' → weights = np.arange(1, window_size+1) + # 'exponential' → weights = np.power(decay, np.arange(window_size-1, -1, -1)) + decay: float = Field(default=0.7, gt=0.0, lt=1.0) + + +class SeasonalAverageModelConfig(ModelConfigBase): + """Target-only baseline: average of prior matching seasonal positions.""" + model_type: Literal["seasonal_average"] = "seasonal_average" + season_length: int = Field(default=7, ge=2, le=365) + lookback_cycles: int = Field(default=4, ge=2, le=12) + trim_outliers: bool = False # if True, drop top + bottom value before mean + + +class TrendRegressionBaselineModelConfig(ModelConfigBase): # OPTIONAL + """Target-only Ridge baseline: elapsed-time + simple calendar features.""" + model_type: Literal["trend_regression_baseline"] = "trend_regression_baseline" + alpha: float = Field(default=1.0, ge=0.0, le=1000.0) + include_dow: bool = True + include_month: bool = True + + +class RandomForestModelConfig(ModelConfigBase): # OPTIONAL + """Feature-aware sklearn RandomForest with feature_importances_.""" + model_type: Literal["random_forest"] = "random_forest" + n_estimators: int = Field(default=100, ge=10, le=500) + max_depth: int | None = Field(default=10, ge=2, le=64) + min_samples_leaf: int = Field(default=2, ge=1, le=50) + feature_config_hash: str | None = None # matches existing tree-config pattern + + +# ─── Extend the discriminated union (app/features/forecasting/schemas.py:268) ─ +ModelConfig = Annotated[ + NaiveModelConfig + | SeasonalNaiveModelConfig + | MovingAverageModelConfig + | WeightedMovingAverageModelConfig # NEW + | SeasonalAverageModelConfig # NEW + | TrendRegressionBaselineModelConfig # NEW (optional) + | RandomForestModelConfig # NEW (optional) + | LightGBMModelConfig + | XGBoostModelConfig + | RegressionModelConfig + | ProphetLikeModelConfig, + Field(discriminator="model_type"), +] + + +# ─── app/features/backtesting/schemas.py — new fields (additive) ───────── + +# FoldResult adds: +horizon_bucket_metrics: dict[str, dict[str, float]] = Field( + default_factory=dict, + description="Per-bucket metrics keyed by bucket id ('h_1_7','h_8_14'," + "'h_15_28','h_29_plus'). Empty bucket entries are dropped.", +) + +# main_model_results.aggregate_metrics gains a 'rmse' key and a new +# 'bucketed_aggregate_metrics: dict[str, dict[str, float]]' top-level dict +# whose keys are the same bucket ids. + + +# ─── app/features/registry/schemas.py — new fields (additive, Optional) ── + +# Both RunResponse and RunDetailResponse gain: +feature_frame_version: int | None = Field( + default=None, + description="Feature frame version recorded by the training run " + "(read from runtime_info; None when the run pre-dates PRP-35).", +) +feature_groups: dict[str, list[str]] | None = Field( + default=None, + description="Per-group canonical column manifest at training time " + "(None for V1 and pre-PRP-35 runs).", +) + + +# ─── app/features/ops/schemas.py — extend stale-reason enum ────────────── + +class StaleReason(str, Enum): + NEWER_SUCCESS_RUN = "newer_success_run" # existing + ARTIFACT_NOT_VERIFIED = "artifact_not_verified" # existing + RUN_NOT_SUCCESS = "run_not_success" # existing + FEATURE_FRAME_VERSION_MISMATCH = "feature_frame_version_mismatch" # NEW +``` + +### List of tasks to be completed (dependency-ordered) + +```yaml +Task 1 — CONTRACT REFRESH (gates every other task): + - VERIFY PRP-35 is merged. Run: + uv run python -c "from app.shared.feature_frames import FEATURE_FRAME_VERSION_V2, FeatureGroup, build_historical_feature_rows_v2, build_future_feature_rows_v2, v2_feature_groups_dict, v2_feature_safety_classes; print('PRP-35 surface OK')" + If ImportError: STOP. PRP-35 has not landed. Do not write any code. + - RE-READ PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md § "Data models and structure" and § "Integration Points" — capture the FINAL bundle.metadata schema. + - DIFF the metadata field names this PRP cites against what PRP-35 shipped. The cited names are: + bundle.metadata["feature_frame_version"] -> int + bundle.metadata["feature_columns"] -> list[str] + bundle.metadata["feature_groups"] -> dict[str, list[str]] + bundle.metadata["feature_safety_classes"] -> dict[str, str] + bundle.metadata["feature_pinned_constants"] -> dict[str, list[int]] + - PATCH any drift between this PRP's assumed names and the merged contract by updating THIS PRP file in a `chore(docs): refresh PRP-36 against PRP-35 final contract (#)` commit BEFORE Task 2 starts. + - CONFIRM bundle.metadata["feature_frame_version"] defaults to 1 when absent (the load-side back-compat). + - VERIFY TrainRequest.feature_frame_version + TrainRequest.feature_groups exist in app/features/forecasting/schemas.py with the V1=default + V2-validator semantics PRP-35 locked. + - VERIFY backtesting/service.py dispatches at lines 493 / 553 between V1 and V2 builders via bundle.metadata.get("feature_frame_version", 1) — the PRP-35 Task 13 work. + - LOG the captured contract snapshot into PRPs/ai_docs/prp-35-final-contract-snapshot.md (one-off; gives Slice C a stable reference). + - DO NOT proceed to Task 2 if any drift is unresolved. + +Task 2 — CREATE app/features/forecasting/tests/test_weighted_moving_average_forecaster.py + IMPLEMENT WeightedMovingAverageForecaster: + - TEST FIRST: write the unit-test file with: fit-raises-on-empty, fit-then-predict-shape, deterministic-with-seed, linear-weights-match-np.average, exponential-weights-match-np.average, window_size-larger-than-history-raises, persistence-round-trip. + - IMPLEMENT class WeightedMovingAverageForecaster(BaseForecaster) in app/features/forecasting/models.py — mirror MovingAverageForecaster (L384): + requires_features: ClassVar[bool] = False + fit(y, X=None): stores last `window_size` observations; raises ValueError if len(y) < window_size. + predict(horizon, X=None): np.average(self._tail, weights=self._weights) → np.full(horizon, mean_value) + - ADD WeightedMovingAverageModelConfig in app/features/forecasting/schemas.py (per data model above). + - EXTEND ModelConfig union at L268-276. + - EXTEND _MODEL_FAMILY_MAP in app/features/forecasting/feature_metadata.py — map "weighted_moving_average" → ModelFamily.BASELINE. + - WIRE INTO model_factory at L1138-1227 — add an elif branch that returns WeightedMovingAverageForecaster(window_size=config.window_size, weight_strategy=config.weight_strategy, decay=config.decay, random_state=random_state). + - GATE: NO new flag in settings; this baseline is always on. + +Task 3 — CREATE app/features/forecasting/tests/test_seasonal_average_forecaster.py + IMPLEMENT SeasonalAverageForecaster: + - TEST FIRST: fit-then-predict-shape; same-DOW averaging actually picks matching positions; lookback_cycles smaller than history works; trim_outliers drops the top + bottom value when True; deterministic; persistence round-trip. + - IMPLEMENT class SeasonalAverageForecaster(BaseForecaster) mirroring SeasonalNaiveForecaster (L281): + requires_features: ClassVar[bool] = False + fit(y, X=None): stores last (lookback_cycles * season_length) observations. + predict(horizon, X=None): for each horizon day j, compute mean (or trimmed mean if trim_outliers) of y values at offsets {j - k*season_length} for k in [1..lookback_cycles] that lie within stored history. + - ADD SeasonalAverageModelConfig; extend union; extend _MODEL_FAMILY_MAP → BASELINE; wire into model_factory. + +Task 4 — OPTIONAL: CREATE TrendRegressionBaselineForecaster (decide YES/NO in the planning review; if NO, drop from this PRP): + - TEST FIRST: deterministic seed; intercept + slope coefficients match np.polyfit on a perfect-line series; calendar features add expected columns when toggled. + - IMPLEMENT class TrendRegressionBaselineForecaster(BaseForecaster) using sklearn.linear_model.Ridge inside a pure-numpy feature builder (elapsed-day + optional dow_one_hot + optional month_one_hot). DOES NOT use V1 or V2 row builders — its feature set is purely calendar-derived. + requires_features: ClassVar[bool] = False + - ADD TrendRegressionBaselineModelConfig; extend union; extend _MODEL_FAMILY_MAP → ADDITIVE; wire into model_factory. + +Task 5 — OPTIONAL: CREATE RandomForestForecaster (decide YES/NO; gate with `forecast_enable_random_forest: bool = False` IF YES): + - TEST FIRST: requires_features=True; fit with shape-matched X; deterministic with random_state=42 + n_jobs=1; feature_importances_ shape == (len(feature_columns),). + - IMPLEMENT class RandomForestForecaster(BaseForecaster) wrapping sklearn.ensemble.RandomForestRegressor: + requires_features: ClassVar[bool] = True + __init__: stores n_estimators, max_depth, min_samples_leaf, random_state. n_jobs=1 (REQUIRED for determinism — verified). + fit(y, X): self._estimator = RandomForestRegressor(...).fit(X, y); save self._feature_columns = X.columns if pandas else None. + predict(horizon, X): X is the future feature matrix (built by forecasting service via V1 or V2 builders — same dispatch as RegressionForecaster); return self._estimator.predict(X). + - ADD RandomForestModelConfig; extend union; extend _MODEL_FAMILY_MAP → TREE. + - EXTEND extract_feature_importance L147 — add `RandomForestForecaster` to the isinstance tuple; nothing else changes (the existing tree-importance branch already reads `.feature_importances_`). + - ADD `forecast_enable_random_forest: bool = False` to app/core/config.py + `.env.example`. + - GATE in model_factory: `if not settings.forecast_enable_random_forest: raise ValueError("random_forest is opt-in; set forecast_enable_random_forest=true")`. + +Task 6 — TIGHTEN existing feature-aware config defaults (conservative + documented): + - app/features/forecasting/schemas.py: + RegressionModelConfig — defaults unchanged unless the implementer can justify a strictly-better conservative default via backtest evidence; documented in commit message. Otherwise: NO CHANGE. + LightGBMModelConfig — confirm defaults match the determinism recipe (deterministic=True is a runtime arg; n_jobs=1 is a runtime arg). EXPOSE: n_jobs (default 1, max 1 — fixed), deterministic (default True). + XGBoostModelConfig — confirm tree_method="hist", n_jobs=1, verbosity=0 are wired into the forecaster instantiation. EXPOSE: n_jobs (default 1, max 1). + ProphetLikeModelConfig — confirm Ridge alpha range. NO CHANGE without justification. + - DOCUMENT every config tightening with a one-line comment in the schema docstring AND in CHANGELOG under "Unreleased". + - CRITICAL: do NOT change any default that would break bundle_hash for already-trained models. The existing `schema_version` field (ModelConfigBase L37-41) is the canary; bump it ONLY if a backward-incompatible default change is unavoidable. Default position: no bump. + +Task 7 — EXTEND backtesting metrics with RMSE + per-horizon-bucket helper: + - app/features/backtesting/metrics.py: + @staticmethod + def rmse(actuals, predictions) -> MetricResult: + # mirrors mae() shape; formula: sqrt(mean((A - F) ** 2)) + - ADD module-level constant HORIZON_BUCKETS: tuple[tuple[str, int, int | None], ...] = ( + ("h_1_7", 1, 7), + ("h_8_14", 8, 14), + ("h_15_28", 15, 28), + ("h_29_plus", 29, None), + ) + - ADD function compute_bucket_metrics(actuals, predictions, horizon_offsets: list[int]) -> dict[str, dict[str, float]]: + For each bucket, slice the (actuals, predictions) pair by horizon_offsets in [start, end] inclusive (end=None → unbounded). Skip a bucket if its slice is empty. Call calculate_all on each non-empty slice. Return dict keyed by bucket id. + - EXTEND MetricsCalculator.calculate_all to include rmse alongside mae/smape/wape/bias. + - DO NOT change aggregate_fold_metrics signature; ADD a sibling aggregate_bucket_metrics(fold_bucket_metrics: list[dict[str, dict[str, float]]]) -> dict[str, dict[str, float]] that returns per-bucket means across folds, skipping NaN. + +Task 8 — WIRE backtesting service to emit per-fold horizon_bucket_metrics: + - app/features/backtesting/service.py: + - For each fold, compute `horizon_offsets = [(test_dates[i] - test_dates[0]).days + 1 for i in range(len(test_dates))]` (test_dates[0] is horizon day 1). + - After computing the existing per-fold metrics, call compute_bucket_metrics(actuals, predictions, horizon_offsets) and attach to FoldResult.horizon_bucket_metrics. + - After the fold loop, compute aggregate_bucket_metrics across all fold_bucket_metric dicts → main_model_results.bucketed_aggregate_metrics. + - Mirror for baseline_results when baselines are run alongside. + - PRESERVE the V1/V2 dispatch PRP-35 Task 13 added — no change to it. + - PRESERVE leakage_check_passed flow. + - LOG: per-fold metric log lines now include feature_frame_version (already added by PRP-35) AND the bucket count. + +Task 9 — EXTEND backtesting schemas: + - FoldResult: add horizon_bucket_metrics: dict[str, dict[str, float]] = Field(default_factory=dict, ...). + - main_model_results.aggregate_metrics: include rmse (additive — no breaking change). + - main_model_results: add bucketed_aggregate_metrics: dict[str, dict[str, float]] | None = None. + - Mirror for baseline_results. + - PRESERVE: ConfigDict(strict=True); plain numeric/string fields — no strict=False overrides needed (no date/UUID/Decimal involved). + +Task 10 — MODIFY app/features/registry/service.py — _find_duplicate AND comparable_runs: + - FIND _find_duplicate at L629-672. + - ADD a `feature_frame_version` parameter to its match key (read from RunCreate.runtime_info["feature_frame_version"] when present; default 1 when absent — back-compat). + - ADD a sibling helper async def find_comparable_runs(self, db, *, store_id: int, product_id: int, model_type: str | None, feature_frame_version: int, data_window_start: date, data_window_end: date, limit: int = 20) -> list[ModelRun]: + Returns: SUCCESS runs for the same (store_id, product_id) where the data window overlaps AND feature_frame_version matches; ordered by created_at desc; limit applied. + - DO NOT change create_alias / update_alias precondition (status == SUCCESS). + - PRESERVE artifact_hash verification flow. + +Task 11 — MODIFY app/features/registry/schemas.py: + - RunResponse: add feature_frame_version: int | None = None, feature_groups: dict[str, list[str]] | None = None — both Optional, both read from `runtime_info` JSONB via a Pydantic validator (the existing model_family computed_field is the precedent). + - RunDetailResponse: same additive fields. + - DO NOT introduce a cross-slice import for these fields (the field types are plain Python — no risk of the computed-field cycle from memory `computed-field-cross-slice-cycle`). + +Task 12 — MODIFY app/features/ops/service.py — comparable-run + stale-reason mismatch path: + - FIND comparable-run selection L412-427. + - REPLACE the selection with `await registry_service.find_comparable_runs(db, store_id=..., product_id=..., model_type=..., feature_frame_version=..., data_window_start=..., data_window_end=...)`. + - FIND _alias_staleness at L137-159. + - ADD a new staleness branch: when an alias's run has feature_frame_version=V_a AND a newer comparable SUCCESS run has feature_frame_version=V_b WHERE V_a != V_b → is_stale=True, reason=StaleReason.FEATURE_FRAME_VERSION_MISMATCH. + - PRESERVE the existing reasons (NEWER_SUCCESS_RUN, ARTIFACT_NOT_VERIFIED, RUN_NOT_SUCCESS). + - PRESERVE the drift_direction rank map (degrading > improving > stable > unknown). + +Task 13 — MODIFY app/features/ops/schemas.py: + - StaleReason enum: add FEATURE_FRAME_VERSION_MISMATCH = "feature_frame_version_mismatch". + - StaleAliasResponse and ModelHealthEntry: expose `alias_feature_frame_version` and `comparable_run_feature_frame_version` (both Optional) so Slice C can render the mismatch. + +Task 14 — MODIFY app/features/explainability/service.py + explainers.py: + - explainers.py: ADD WeightedMovingAverageExplainer and SeasonalAverageExplainer — mirror the simple-arithmetic shape of MovingAverageExplainer / SeasonalNaiveExplainer. Reason codes from `reason_codes.py` flow through unchanged. + - service.py: REGISTER the new explainers in the explainer_factory (the existing if-elif at L205 or its successor). NEW model_types route to their new explainer classes; the 400 "unsupported model type" path keeps catching anything truly unsupported. + - If TrendRegressionBaselineForecaster ships: ADD TrendRegressionBaselineExplainer (Ridge coefficients → "trend coefficient X, dow coefficient Y_d for d in DOW"). + - If RandomForestForecaster ships: route /explain/runs/{run_id} for `random_forest` to a path that calls `extract_feature_importance` (feature-aware code path) AND a simple "tree-importance" explanation. Mirror the existing prophet_like explainability shape — but DO NOT introduce a /explain/forecast handler for random_forest in this PRP (that requires a forecast horizon + bundle reload, which is out of scope here). + - PRESERVE: HGBR-not-supported path stays as is (FeatureImportanceUnavailableError continues to surface as 422). + - PRESERVE: every reason code from reason_codes.py. + +Task 15 — UPDATE tests: + - app/features/forecasting/tests/test_feature_metadata.py — assert every new model_type appears in _MODEL_FAMILY_MAP; assert model_family_for("random_forest") == ModelFamily.TREE (if shipped). + - app/features/backtesting/tests/test_metrics.py — rmse correctness + sign convention; compute_bucket_metrics on a hand-crafted horizon array with bucket boundary cases (empty h_29_plus on a 14-day horizon). + - app/features/backtesting/tests/test_service.py — assert FoldResult.horizon_bucket_metrics shape; assert empty bucket is dropped. + - app/features/backtesting/tests/test_feature_aware_backtest_v2.py — UNCHANGED (PRP-35 owns it; do not weaken). + - app/features/registry/tests/test_service.py — V1-vs-V2 not a duplicate; find_comparable_runs returns only matching feature_frame_version runs. + - app/features/registry/tests/test_schemas.py — RunResponse round-trips feature_frame_version + feature_groups from runtime_info. + - app/features/ops/tests/test_service.py — stale-reason mismatch path; comparable-run selection excludes different feature_frame_version. + - app/features/explainability/tests/test_service.py — every new baseline routes to its explainer; HGBR still 422; random_forest 200 with tree importances (if shipped). + +Task 16 — CREATE examples/forecasting/model_zoo_compare.py: + - Read-only diagnostic script — given a (store_id, product_id) pair, train + backtest the seven (or nine) models against the seeded DB, print a metrics + registry-candidate summary table with per-bucket WAPE. + - Uses the public services (no DB writes outside the existing /forecasting/train + /backtesting/run flow). + - Documented in docs/optional-features/05-advanced-ml-model-zoo.md (existing optional-features doc). + +Task 17 — UPDATE docs: + - docs/optional-features/05-advanced-ml-model-zoo.md — describe each new model + bucketed metrics + the example script. + - docs/optional-features/09-model-champion-challenger-governance.md — describe the feature_frame_version comparability rule. + - docs/_base/API_CONTRACTS.md — update /backtesting/run response shape (FoldResult.horizon_bucket_metrics; main_model_results.bucketed_aggregate_metrics; rmse in aggregate); update /registry/runs/{id} response shape (Optional feature_frame_version + feature_groups). + - docs/_base/DOMAIN_MODEL.md — update the "comparable run" definition. + +Task 18 — VERIFY no Alembic migration is needed: + - All new state rides in existing JSONB columns. Run `uv run alembic check` → must report no pending revisions. +``` + +### Per task pseudocode (the load-bearing parts) + +```python +# Task 7 — RMSE +@staticmethod +def rmse(actuals, predictions) -> MetricResult: + """Root Mean Squared Error. Penalises large errors more than MAE.""" + warnings: list[str] = [] + if len(actuals) == 0: + return MetricResult(name="rmse", value=float("nan"), n_samples=0, warnings=["Empty array"]) + if len(actuals) != len(predictions): + raise ValueError(f"Length mismatch: actuals={len(actuals)}, predictions={len(predictions)}") + rmse_value = float(np.sqrt(np.mean((actuals - predictions) ** 2))) + return MetricResult(name="rmse", value=rmse_value, n_samples=len(actuals), warnings=warnings) + + +# Task 7 — bucket helper +HORIZON_BUCKETS: tuple[tuple[str, int, int | None], ...] = ( + ("h_1_7", 1, 7), + ("h_8_14", 8, 14), + ("h_15_28", 15, 28), + ("h_29_plus", 29, None), +) + +def compute_bucket_metrics( + actuals: np.ndarray, + predictions: np.ndarray, + horizon_offsets: list[int], +) -> dict[str, dict[str, float]]: + """Per-horizon-bucket metric block. Empty buckets are dropped.""" + if not (len(actuals) == len(predictions) == len(horizon_offsets)): + raise ValueError("array length mismatch") + calc = MetricsCalculator() + out: dict[str, dict[str, float]] = {} + h = np.asarray(horizon_offsets) + for bucket_id, start, end in HORIZON_BUCKETS: + mask = (h >= start) & (h <= (end if end is not None else h.max())) + if not mask.any(): + continue + bucket = calc.calculate_all(actuals[mask], predictions[mask]) + bucket["rmse"] = calc.rmse(actuals[mask], predictions[mask]).value + out[bucket_id] = bucket + return out + + +# Task 2 — WeightedMovingAverageForecaster (key parts) +class WeightedMovingAverageForecaster(BaseForecaster): + """Target-only baseline: weighted average of last `window_size` observations. + + Weighting: + - 'linear': weights = [1, 2, ..., window_size] (most recent weighted highest) + - 'exponential': weights = [decay**(W-1), ..., decay**1, decay**0] + """ + + requires_features: ClassVar[bool] = False + + def __init__(self, *, window_size: int, weight_strategy: str, decay: float, random_state: int = 42) -> None: + super().__init__(random_state=random_state) + self.window_size = window_size + self.weight_strategy = weight_strategy + self.decay = decay + self._weights: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None + self._weighted_mean: float | None = None + + def fit(self, y, X=None): + y = np.asarray(y, dtype=np.float64) + if y.size < self.window_size: + raise ValueError(f"need at least {self.window_size} observations, got {y.size}") + tail = y[-self.window_size:] + if self.weight_strategy == "linear": + self._weights = np.arange(1, self.window_size + 1, dtype=np.float64) + else: # exponential + self._weights = np.power(self.decay, np.arange(self.window_size - 1, -1, -1, dtype=np.float64)) + self._weighted_mean = float(np.average(tail, weights=self._weights)) + self._is_fitted = True + return self + + def predict(self, horizon, X=None): + if not self._is_fitted or self._weighted_mean is None: + raise RuntimeError("WeightedMovingAverageForecaster is not fitted") + return np.full(horizon, self._weighted_mean, dtype=np.float64) + + +# Task 3 — SeasonalAverageForecaster (key parts) +class SeasonalAverageForecaster(BaseForecaster): + """Target-only baseline: average of prior matching seasonal positions. + + For horizon day j with season_length S, average the values at offsets + {j - k*S} for k in [1..lookback_cycles] that fall inside the stored history. + """ + + requires_features: ClassVar[bool] = False + + def __init__(self, *, season_length: int, lookback_cycles: int, trim_outliers: bool, random_state: int = 42) -> None: + super().__init__(random_state=random_state) + self.season_length = season_length + self.lookback_cycles = lookback_cycles + self.trim_outliers = trim_outliers + self._history: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None + + def fit(self, y, X=None): + y = np.asarray(y, dtype=np.float64) + min_required = self.season_length * 2 # at minimum, one full cycle to average over + if y.size < min_required: + raise ValueError(f"need at least {min_required} observations, got {y.size}") + self._history = y[-(self.season_length * self.lookback_cycles):] + self._is_fitted = True + return self + + def predict(self, horizon, X=None): + if not self._is_fitted or self._history is None: + raise RuntimeError("SeasonalAverageForecaster is not fitted") + out = np.zeros(horizon, dtype=np.float64) + H = self._history + S = self.season_length + for j in range(horizon): + target_offset = j + 1 # horizon day index, 1-based + samples: list[float] = [] + for k in range(1, self.lookback_cycles + 1): + idx_from_end = k * S - target_offset + if 0 <= idx_from_end < H.size: + samples.append(float(H[H.size - 1 - idx_from_end])) + if not samples: + # Fallback: use the last observed value (defensive — should + # not happen given the fit-time min_required check). + out[j] = float(H[-1]) + continue + arr = np.asarray(samples) + if self.trim_outliers and arr.size >= 4: + arr = np.sort(arr)[1:-1] # drop min + max + out[j] = float(arr.mean()) + return out + + +# Task 10 — find_comparable_runs (key parts) +async def find_comparable_runs( + self, + db, + *, + store_id: int, + product_id: int, + model_type: str | None, + feature_frame_version: int, + data_window_start: date, + data_window_end: date, + limit: int = 20, +) -> list[ModelRun]: + """SUCCESS runs comparable to the (grain, window, V) tuple given. + + Comparable predicate: + - same store_id AND same product_id; + - data windows overlap (run.data_window_end >= start AND run.data_window_start <= end); + - same feature_frame_version (read from runtime_info JSONB; defaults 1 when absent); + - status == SUCCESS. + + Ordered by created_at desc; capped by `limit`. `model_type=None` means + "any model_type" — caller filters further if narrower. + """ + stmt = ( + select(ModelRun) + .where(ModelRun.store_id == store_id) + .where(ModelRun.product_id == product_id) + .where(ModelRun.status == RunStatus.SUCCESS) + .where(ModelRun.data_window_end >= data_window_start) + .where(ModelRun.data_window_start <= data_window_end) + # JSONB extraction: coalesce missing key to '1' string then cast. + .where( + (cast(ModelRun.runtime_info["feature_frame_version"].astext, Integer) == feature_frame_version) + | (and_(feature_frame_version == 1, ModelRun.runtime_info["feature_frame_version"].astext.is_(None))) + ) + .order_by(ModelRun.created_at.desc()) + .limit(limit) + ) + if model_type is not None: + stmt = stmt.where(ModelRun.model_type == model_type) + result = await db.execute(stmt) + return list(result.scalars().all()) +# CRITICAL: the JSONB "missing key = V1" clause is the back-compat seam — +# legacy V1 runs never wrote feature_frame_version, so absent key MUST +# resolve to V=1. +``` + +### Integration Points + +```yaml +DATABASE: + - migration: NONE — all new state rides in existing JSONB columns. Verify with `uv run alembic check`. + - reads: ModelRun.runtime_info JSONB; ModelRun.data_window_start / data_window_end / store_id / product_id / status / created_at. + - writes: model_run.runtime_info gains `feature_frame_version: int` + `feature_groups: dict[str, list[str]]` (additive — PRP-35 already populates these via ForecastingService.train_model extra_metadata). + +CONFIG: + - app/core/config.py: NO new settings if random_forest is dropped from scope. IF random_forest ships: `forecast_enable_random_forest: bool = False`. + - .env.example: matches the new setting if added. + +ROUTES: + - No new endpoint paths. + - /forecasting/train: accepts new model_type values transparently via the discriminated union. + - /backtesting/run: response gains horizon_bucket_metrics + bucketed_aggregate_metrics + rmse (additive — Slice C reads these). + - /registry/runs/{id}: response gains feature_frame_version + feature_groups (additive). + - /ops/stale-aliases, /ops/model-health: response gains StaleReason.FEATURE_FRAME_VERSION_MISMATCH variant + comparable-run feature_frame_version (additive). + +SCHEMAS: + - app/features/forecasting/schemas.py: 2-4 new ModelConfig subclasses + extend ModelConfig union; ModelFamily enum unchanged. + - app/features/backtesting/schemas.py: FoldResult + aggregate gain bucketed fields + rmse. + - app/features/registry/schemas.py: RunResponse + RunDetailResponse gain Optional feature_frame_version + feature_groups. + - app/features/ops/schemas.py: StaleReason enum extended; StaleAliasResponse + ModelHealthEntry gain alias_feature_frame_version + comparable_run_feature_frame_version. + - app/features/forecasting/feature_metadata.py: _MODEL_FAMILY_MAP extended; isinstance tuple in extract_feature_importance gains RandomForestForecaster (if shipped). + +REGISTRY MUTATION SURFACE: + - No new agent tool — agent_require_approval is unchanged. (Tasks 10-13 are pure backend; the agent layer does not see them directly.) + +CHANGELOG: + - Under "Unreleased" → "feat(forecast,backtest,registry,ops): forecast intelligence B — model zoo + backtesting metrics + comparability (#)" (release-please-feed format). +``` + +--- + +## Validation Loop + +### Level 1: Syntax & Style + +```bash +uv run ruff check app/features/forecasting app/features/backtesting \ + app/features/registry app/features/ops \ + app/features/explainability examples/forecasting --fix +uv run ruff format app/features/forecasting app/features/backtesting \ + app/features/registry app/features/ops \ + app/features/explainability examples/forecasting +uv run ruff format --check . + +uv run mypy app/ +uv run pyright app/ + +# Expected: zero errors. If errors, READ the message and fix; never silence. +``` + +### Level 2: Pure unit tests (no DB) + +```bash +# Load-bearing leakage specs MUST stay byte-stable — re-run them first +uv run pytest -v app/shared/feature_frames/tests/test_leakage.py +uv run pytest -v app/shared/feature_frames/tests/test_leakage_v2.py +uv run pytest -v app/features/forecasting/tests/test_regression_features_leakage.py +uv run pytest -v app/features/forecasting/tests/test_regression_features_v2_leakage.py +uv run pytest -v app/features/scenarios/tests/test_future_frame_leakage.py +uv run pytest -v app/features/scenarios/tests/test_future_frame_v2_leakage.py +uv run pytest -v app/features/featuresets/tests/test_leakage.py + +# New / modified unit tests +uv run pytest -v app/features/forecasting/tests/test_weighted_moving_average_forecaster.py +uv run pytest -v app/features/forecasting/tests/test_seasonal_average_forecaster.py +uv run pytest -v app/features/forecasting/tests/test_feature_metadata.py +uv run pytest -v app/features/forecasting/tests/test_models.py +uv run pytest -v app/features/backtesting/tests/test_metrics.py +uv run pytest -v app/features/backtesting/tests/test_service.py +uv run pytest -v app/features/registry/tests/test_service.py +uv run pytest -v app/features/registry/tests/test_schemas.py +uv run pytest -v app/features/ops/tests/test_service.py +uv run pytest -v app/features/explainability/tests/test_service.py + +# If random_forest + trend_regression_baseline ship: +uv run pytest -v app/features/forecasting/tests/test_random_forest_forecaster.py +uv run pytest -v app/features/forecasting/tests/test_trend_regression_baseline_forecaster.py + +# Full unit suite gate +uv run pytest -v -m "not integration" +``` + +### Level 3: Integration tests (real Postgres) + +```bash +docker compose up -d +uv run alembic upgrade head +uv run python scripts/check_db.py + +uv run alembic check # expect "no problems detected" + +# Existing V2 backtest stays green +uv run pytest -v -m integration app/features/backtesting/tests/test_feature_aware_backtest_v2.py + +# New integration tests +uv run pytest -v -m integration app/features/ops/tests/test_routes_integration.py +uv run pytest -v -m integration app/features/backtesting/tests/test_service_integration.py +uv run pytest -v -m integration app/features/registry/tests/test_service.py +``` + +### Level 4: Smoke — model zoo end-to-end + +```bash +uv run uvicorn app.main:app --reload --port 8123 + +# Train each new baseline +curl -sS -X POST http://localhost:8123/forecasting/train \ + -H 'Content-Type: application/json' \ + -d '{ + "store_id": 15, "product_id": 52, + "train_start_date": "2025-01-01", "train_end_date": "2025-12-31", + "config": {"model_type": "weighted_moving_average", "window_size": 7, "weight_strategy": "linear", "decay": 0.7} + }' | jq . + +curl -sS -X POST http://localhost:8123/forecasting/train \ + -H 'Content-Type: application/json' \ + -d '{ + "store_id": 15, "product_id": 52, + "train_start_date": "2025-01-01", "train_end_date": "2025-12-31", + "config": {"model_type": "seasonal_average", "season_length": 7, "lookback_cycles": 4, "trim_outliers": false} + }' | jq . + +# Backtest a feature-aware model and confirm horizon_bucket_metrics + rmse appear +curl -sS -X POST http://localhost:8123/backtesting/run \ + -H 'Content-Type: application/json' \ + -d '{ + "store_id": 15, "product_id": 52, + "start_date": "2025-01-01", "end_date": "2025-12-31", + "config": { + "model_config": {"model_type": "regression", "max_iter": 200, "learning_rate": 0.05, "max_depth": 6, "feature_config_hash": null}, + "split_config": {"n_splits": 4, "horizon": 14, "gap": 0, "strategy": "expanding"}, + "feature_frame_version": 2, + "include_baselines": true + } + }' | jq '.main_model_results.aggregate_metrics, .main_model_results.bucketed_aggregate_metrics' + +# Registry response carries feature_frame_version + feature_groups +curl -sS http://localhost:8123/registry/runs | jq '.[0]' + +# Stale-alias / model-health — when a V1 alias has a newer comparable V2 SUCCESS run +curl -sS http://localhost:8123/ops/stale-aliases | jq '.[] | select(.reason == "feature_frame_version_mismatch")' + +# Optional preview +uv run python examples/forecasting/model_zoo_compare.py --store-id 15 --product-id 52 +``` + +--- + +## Final validation Checklist + +> **GATE FIRST:** PRP-35 is merged. Task 1 succeeded. The bundle.metadata +> contract this PRP cites matches PRP-35's final shipped names. + +- [ ] Task 1 (Contract Refresh) succeeded with zero drift. +- [ ] V1 leakage spec passes unchanged (`app/shared/feature_frames/tests/test_leakage.py`). +- [ ] V2 leakage spec passes unchanged (`app/shared/feature_frames/tests/test_leakage_v2.py`). +- [ ] V1 forecasting leakage spec unchanged. +- [ ] V2 forecasting leakage spec unchanged. +- [ ] V1 + V2 scenarios leakage specs unchanged. +- [ ] V1 + V2 backtesting leakage specs unchanged. +- [ ] V1 featuresets leakage spec unchanged. +- [ ] AST-walk leaf-level invariant passes — `app/shared/feature_frames/**` imports nothing from `app/features/**`. +- [ ] Strict-mode policy linter (`app/core/tests/test_strict_mode_policy.py`) passes — every new request schema with date/UUID/Decimal carries `Field(strict=False)`. +- [ ] New model classes train, predict, persist, load: + - [ ] `weighted_moving_average` + - [ ] `seasonal_average` + - [ ] (optional) `trend_regression_baseline` + - [ ] (optional) `random_forest` AND exposes `feature_importances_` +- [ ] `_MODEL_FAMILY_MAP` covers every new model_type; unknown-fallback path unchanged. +- [ ] `extract_feature_importance` still raises `FeatureImportanceUnavailableError` for HGBR. RandomForestForecaster (if shipped) returns a 1-D importance vector matching `feature_columns` length. +- [ ] `BacktestResponse.main_model_results.aggregate_metrics` includes `rmse`. +- [ ] `BacktestResponse.main_model_results.bucketed_aggregate_metrics` is non-empty when the horizon spans bucket boundaries; empty buckets are dropped. +- [ ] `FoldResult.horizon_bucket_metrics` shape verified on a synthetic horizon. +- [ ] V1 bundle backtest + V2 bundle backtest both run on identical fold boundaries. +- [ ] `RegistryService._find_duplicate` distinguishes V1 vs V2 (a V1 run and a V2 run with otherwise-identical fields are NOT duplicates). +- [ ] `RegistryService.find_comparable_runs` returns only runs with matching feature_frame_version (and overlapping window, same grain). +- [ ] `RunResponse` + `RunDetailResponse` expose `feature_frame_version` + `feature_groups` (None for pre-PRP-35 runs). +- [ ] `OpsService` stale-alias reports `FEATURE_FRAME_VERSION_MISMATCH` when an alias's run V_a differs from a newer comparable run V_b. +- [ ] `OpsService` "comparable run" predicate honours feature_frame_version (no cross-version contamination). +- [ ] Explainability handles every new baseline AND `random_forest` (if shipped). HGBR continues to 422. +- [ ] No new endpoint paths. +- [ ] No new Alembic migration (`uv run alembic check` clean). +- [ ] No new managed-cloud SDK; no AutoML. +- [ ] No agent tool added (`agent_require_approval` unchanged). +- [ ] CHANGELOG entry under "Unreleased": `feat(forecast,backtest,registry,ops): forecast intelligence B — model zoo + backtesting metrics + comparability (#)`. +- [ ] `examples/forecasting/model_zoo_compare.py` runs against the local DB and prints the metrics table. +- [ ] Manual smoke (Level 4) — all curls 200; JSON shapes match this PRP's spec. +- [ ] `uv run ruff check .` + `uv run ruff format --check .` clean. +- [ ] `uv run mypy app/` clean (strict). +- [ ] `uv run pyright app/` clean (strict). +- [ ] `uv run pytest -v -m "not integration"` green. +- [ ] `uv run pytest -v -m integration` green (with docker-compose up). + +--- + +## Open Design Decisions + +Locked here; do not relitigate during execution unless Task 1 surfaces a +mismatch with PRP-35's final shape. + +| # | Decision | Resolution | Why | +|---|----------|------------|-----| +| 1 | `trend_regression_baseline` shipped now or deferred? | **Ship it** unless Task 1 surfaces unresolved drift. The Ridge baseline gives a clean target-only "trend + calendar" comparator and matches the existing prophet_like additive lineage. Cost: ~150 LoC + 1 test file. | Marginal scope; outsized comparator value. | +| 2 | `random_forest` shipped now or deferred? | **Ship it.** Pure sklearn dep (already core); exposes `feature_importances_` (verified); deterministic with `random_state=42, n_jobs=1` (verified). Compute cost on a single store/product is acceptable for the local-host vision. | Adds an honest tree comparator with feature_importances_ that HGBR cannot give. | +| 3 | `weighted_moving_average` decay strategy: linear vs exponential? | **Both, via `weight_strategy` enum.** Default = "linear" (simpler, more intuitive). "Exponential" is the StatsForecast canon. | One model class, two weighting schemes, two test paths. | +| 4 | `seasonal_average` averages over last N cycles or all available? | **Last N cycles (config: lookback_cycles, default 4).** All-available is a degenerate special case; the N-cycle window is the StatsForecast / Nixtla canon and keeps the estimator stable across long histories. | Bounded memory; predictable behaviour. | +| 5 | "Comparable run" must share `feature_frame_version`? | **Yes — same grain AND overlapping data window AND same feature_frame_version.** Cross-V comparison silently breaks the alias contract. | Champion alias must point at a stable training contract. | +| 6 | Per-horizon-bucket id naming: `h_1_7` vs `h_1-7` vs camelCase? | **Snake_case with underscore range (`h_1_7`, `h_8_14`, `h_15_28`, `h_29_plus`).** JSON-key-safe; TypeScript-friendly; matches the existing metric naming (`mae`, `wape`). | Stable string keys; no enum confusion in JSON. | +| 7 | Stale reason on V_a != V_b: separate enum value or NEWER_SUCCESS_RUN with extra metadata? | **Separate enum value `FEATURE_FRAME_VERSION_MISMATCH`.** The UI affordance is different (Slice C wants to surface a "this alias's V is now stale" badge separately from "a newer run exists"). | Distinct operational meaning → distinct enum. | +| 8 | Where does `feature_frame_version` ride on `RunResponse`? | **As an Optional top-level field, parsed from `runtime_info` JSONB via a Pydantic validator.** No DB-column promotion. | Avoids an Alembic migration; matches the additive pattern PRP-35 used. | +| 9 | Tightening existing model config defaults? | **NO change unless backtest evidence justifies it AND the implementer adds the regression test.** Defaults that change `bundle_hash` are forbidden in this PRP. | Don't break in-flight bundles. | +| 10 | Per-horizon-bucket aggregate dropped or NaN'd for empty buckets? | **DROPPED.** A 14-day horizon's `h_29_plus` bucket simply does not appear in the response — JSON stays terse and Slice C never has to interpret a NaN. | Slimmer payloads; clear semantics. | + +--- + +## Unresolved Contract Assumptions (waiting on PRP-35 execution) + +Each item below is an assumption this PRP makes about PRP-35's final shape. +Task 1 (Contract Refresh) MUST verify each one. If any assumption breaks, +patch the relevant Task in this PRP file BEFORE writing any new code. + +1. `bundle.metadata["feature_frame_version"]: int` exists for V2 bundles and + defaults to 1 for V1 bundles (via `.get(key, 1)` at the consumer side). + PRP-35 Tasks 9 + 12 + 13 promise this; Task 1 verifies. +2. `bundle.metadata["feature_columns"]: list[str]` is set for V1 AND V2 + bundles. PRP-35 Task 9 promises this; the V1 path already existed + pre-PRP-35 (we rely on PRP-35 keeping it). +3. `bundle.metadata["feature_groups"]: dict[str, list[str]]` is set for V2 + bundles ONLY (absent for V1). PRP-35 § Integration Points promises this. +4. `bundle.metadata["feature_safety_classes"]: dict[str, str]` is set for V2 + bundles ONLY. PRP-35 § Integration Points promises this. +5. `bundle.metadata["feature_pinned_constants"]: dict[str, list[int]]` is + set for V2 bundles ONLY. PRP-35 Task 9 promises this. +6. `TrainRequest.feature_frame_version: int = 1` and + `TrainRequest.feature_groups: list[str] | None = None` exist on the + schema with the V1-rejects-feature_groups validator (the post-patch + wording from this conversation). PRP-35 Task 7 promises this. +7. `backtesting/service.py` already reads + `bundle.metadata.get("feature_frame_version", 1)` BEFORE the fold loop + AND dispatches the build_*_feature_rows_v2 calls at the V1 call sites + (lines 493 / 553 in the V1 codebase). PRP-35 Task 13 promises this. +8. `forecasting/service.py` already writes `feature_frame_version` AND + `feature_groups` into `extra_metadata` (and thence `model_run.runtime_info` + via the registry create_run path). PRP-35 Task 9 promises this. +9. `app/features/forecasting/v2_loaders.py` exposes `load_lifecycle_attrs`, + `load_inventory_history`, `load_replenishment_history`, + `load_returns_history`, `load_promotion_history`, `load_exogenous_history`, + `assemble_v2_historical_sidecar`, `assemble_v2_future_sidecar`. PRP-35 + Task 8 promises this. The model_zoo backtest path reuses them. +10. `FeatureGroup` enum names match the values used in + `DEFAULT_V2_GROUPS = (TARGET_HISTORY, ROLLING, TREND, CALENDAR, + PRICE_PROMO, LIFECYCLE)`. PRP-35 Task 1 promises this. + +If ANY assumption above fails Task 1 verification: open a `chore(docs): +refresh PRP-36 against PRP-35 final contract (#)` PR that +edits THIS PRP file in place, THEN proceed to Task 2. + +--- + +## Anti-Patterns to Avoid + +- ❌ Don't modify any V1 builder signature, return type, or body — PRP-35 + froze V1. Dispatch lives at the service layer. +- ❌ Don't cite `HistGradientBoostingRegressor.feature_importances_` — it + does not exist on HGBR (memory `histgbr-no-feature-importances`). The + existing `FeatureImportanceUnavailableError` is the contract; don't + weaken it. +- ❌ Don't add `permutation_importance` behind the existing explainability + endpoints in this PRP — that's a separate PRP (compute budget + UI + question). +- ❌ Don't introduce a new Alembic migration; every new field rides in + existing JSONB columns. +- ❌ Don't change the demo pipeline (`scripts/run_demo.py` / + `app/features/demo/pipeline.py`) — it's Slice C territory. +- ❌ Don't change `bundle_hash` for in-flight bundles — every config-default + change must justify itself with a regression test AND a `schema_version` + bump if it's behaviour-changing. +- ❌ Don't compare across `feature_frame_version` in champion/challenger or + stale-alias logic — that silently breaks the alias contract. +- ❌ Don't import `lightgbm` or `xgboost` at module load time; the lazy + imports stay inside `fit`. +- ❌ Don't add an agent tool in this PRP — `agent_require_approval` is + unchanged. +- ❌ Don't widen `app/shared/feature_frames/**` to import from any features + slice — the AST-walk invariant catches it. +- ❌ Don't refactor `data_platform.models` consumers (memory + `data-platform-shared-orm-layer`) — that's a different PRP. +- ❌ Don't fabricate per-horizon-bucket data — if no test point falls in a + bucket, drop the bucket from the response. +- ❌ Don't promote a "newer-but-worse" run. The Promote affordance in + Slice C will surface the comparable-run metrics — this PRP's job is to + make those metrics correctly computed and correctly grouped. + +--- + +## Confidence + +**Confidence: 7/10** for one-pass implementation success after PRP-35 lands. + +What grounds the 7: +- The four library claims this PRP needs (HGBR no fi, RF has fi 1-D, + RF deterministic with `random_state + n_jobs=1`, np.average weights) are + verified at runtime against the live env (sklearn 1.8.0, numpy 2.4.1). + Commands captured in "Known Gotchas". +- Every seam is anchored at file:line — both the existing surfaces (model + factory, _MODEL_FAMILY_MAP, _find_duplicate, _alias_staleness, metrics + calculator) and the PRP-35-created surfaces. +- The "comparable run" rule resolves the ops semantic cleanly: same grain + + overlapping window + same V. The mismatch path gets its own enum + value so Slice C can surface it distinctly. +- The bucket-id naming is stable string keys; the empty-bucket drop rule + keeps the JSON terse for Slice C. + +What costs the 3 points: +- **PRP-35 has not landed.** Task 1 is the gate; until it succeeds, every + later task is conditional on assumptions matching reality. The + "Unresolved Contract Assumptions" list spells out exactly what to + re-verify. +- `lightgbm` / `xgboost` are not installed in the default venv. Config + tightening is paper-only until the extras are installed; this PRP cannot + prove the runtime tightening works without an integration step. +- Optional models (`trend_regression_baseline`, `random_forest`) add + surface area; if the planning review punts either, several tasks shrink. + Recommended position: ship both. diff --git a/PRPs/PRP-37-forecast-intelligence-C-interactive-ui.md b/PRPs/PRP-37-forecast-intelligence-C-interactive-ui.md new file mode 100644 index 00000000..8cdd9be6 --- /dev/null +++ b/PRPs/PRP-37-forecast-intelligence-C-interactive-ui.md @@ -0,0 +1,1221 @@ +name: "PRP-37 — Forecast Intelligence C: Interactive UI + Operator Workflow" +description: | + Make the Forecast Intelligence A/B backend additions usable by planners + and operators through the React SPA — model-family + feature-frame + selectors, feature-pack toggles, per-horizon-bucket comparison surfaces, + champion/challenger safety affordances, and an explainability layer that + honours every backend caveat (HGBR-unavailable, stockout warning, V1-vs-V2 + alias mismatch). Slice C of the Forecast Intelligence roadmap + (`PRPs/INITIAL/INITIAL-forecast-intelligence-index.md`). + + > **PREREQUISITES — HARD DEPENDENCY ON PRP-35 AND PRP-36.** + > + > This PRP MUST NOT introduce UI affordances for backend fields that have + > not yet landed. Task 1 (Contract Probe) is the gate: it runs against the + > live backend (or `app/features/**/schemas.py`) and produces the EXACT + > field-name list this PRP wires UI to. If a cited PRP-35 / PRP-36 field + > is absent, the corresponding UI task is DEFERRED — not implemented with + > a placeholder, not faked. The INITIAL is explicit: "Do not fake backend + > values in the UI." This PRP honours that as a hard rule. + > + > **Partial-execution mode is supported.** If PRP-35 is merged but PRP-36 + > is not, Tasks tagged `[gate:PRP-35]` ship; tasks tagged `[gate:PRP-36]` + > are deferred to a follow-up PR. If neither is merged, only the + > existing-fields refinements ship (segmented-control polish, table + > refinements, stockout caveats from existing reason_codes). + +## Purpose +A one-pass implementation contract for an AI agent (or human) with access +to the codebase but no prior session context. Land an operator-grade UI +that surfaces the backend semantics PRP-35 + PRP-36 add — feature_frame_version, +feature_groups, per-horizon-bucket metrics, comparable-run version +mismatch, RandomForest feature importances — without inventing values +that don't exist server-side and without bypassing the project's shadcn +component workflow. + +## Core Principles +1. **Backend contracts are read-only.** Every visible value originates from + a backend field. The UI NEVER fabricates a feature_frame_version, + NEVER invents a feature_group, NEVER displays a metric that the backend + did not return. +2. **shadcn workflow is the only path.** Per `.claude/rules/shadcn-ui.md`: + every shadcn component arrives through `pnpm dlx shadcn@4.7.0 …` AND + the `shadcn` skill / MCP. No raw GitHub fetches. Per memory + `radix-ui-vs-per-component-imports`: per-component + `@radix-ui/react-X` imports, never the `radix-ui` barrel. Per memory + `shadcn-cli-version-pin`: pin shadcn@4.7.0 (NOT 5.x — 5.x silently + writes a stub `pnpm-workspace.yaml` and skips the component). +3. **Dense, operator-grade UI.** Not a landing page. The first screen is + the working tool. shadcn controls only — Tabs (used as segmented + controls), Select, Checkbox/Toggle, Slider, Dialog/AlertDialog, Tooltip, + DataTable, Recharts. +4. **URL-shareable state.** Every filter / sort / page parameter flows + through `frontend/src/lib/url-params.ts` (existing). New + model-family / feature-frame-version / feature-groups state goes the + same route, parsed with the project's validation-at-read helpers. +5. **TypeScript strict + Vitest green.** `pnpm tsc --noEmit` + `pnpm lint` + + `pnpm test --run` are merge gates. Every conditional rendering branch + (missing feature_frame_version, no feature importances, stale-alias + with worse latest WAPE, artifact-verification failure, HGBR-unsupported + path) gets a test. +6. **No agent mutation surface widening.** This PRP touches the UI only; + `agent_require_approval` is unchanged. If a future PRP adds a + "Promote to alias via agent" tool, that's a separate scope. +7. **No backend logic.** Model classes, metric formulas, registry + comparability rules — all live in PRP-35 / PRP-36. This PRP consumes + them as JSON and renders them. + +--- + +## Goal + +Deliver, on branch `feat/forecast-ui-interactive-workflow`, an interactive +operator UI that exposes every backend capability PRP-35 + PRP-36 add: + +- **Forecast training control surface** — segmented model-family picker + (Tabs styled as segmented), model-type Select that toggles by family, + V1/V2 feature-frame Select, conditional FeatureGroup multi-select + toggle group, conservative defaults. +- **Backtest comparison surface** — multi-model fold comparison on + identical splits, metric cards (MAE / sMAPE / WAPE / bias / RMSE), + horizon-bucket metric table, "best WAPE / lowest bias" badges, + "stale alias / degrading / stockout-constrained / feature-aware / + baseline" badges, "newer-vs-better" callout. +- **Run detail + run-compare** — feature_frame_version + feature_groups + panel; top feature importances or additive components; artifact hash + verification badge; "comparable with current champion?" indicator. +- **What-if planner** — quick-vary sliders (price delta, promotion, + holiday, inventory, lifecycle), side-by-side baseline-vs-scenario + chart, "model_exogenous vs heuristic" method label, + known-future-input vs hypothetical labelling. +- **Ops control center** — degrading-status explainability (latest + WAPE, previous comparable WAPE, delta, n_comparable_runs, data-window + freshness); safer Promote (AlertDialog with worse-WAPE confirm + artifact + verify + champion/challenger comparison + stale-reason). +- **Batch sweeps** — multi-model + multi-feature-pack submission; + presets (quick baseline sweep / feature-aware comparison / champion- + challenger refresh / stockout-sensitive products / high-WAPE recovery); + PRP-34 parallel-execution controls preserved. +- **Agent/RAG affordances** — copyable context buttons ("Explain why this + model degraded", "Summarize champion vs challenger", "Recommend next + backtest") that pipe into the existing /chat flow. RAG continues to + cite user-guide docs; no new agent tool. + +## Why + +Without this PRP, the backend gains a model zoo + V2 features + per-horizon +metrics + feature_frame_version comparability rules — and operators see +none of it through the dashboard. They can't: + +- choose between same-grain models on identical folds without writing curl; +- distinguish a V1 alias from a V2 challenger (silent drift); +- read the stockout caveat that the backend already emits in reason codes; +- avoid promoting a newer-but-worse run. + +Slice C is the operator surface that makes the A/B work usable. + +## What + +### User-visible behaviour + +- `/visualize/forecast`: New control row — Tabs (Baseline / Tree / + Additive) → Select (model type, filtered by family) → Select + (Feature frame V1 / V2 — disabled+tooltip when backend does not + expose the field yet) → conditional Toggle group of feature packs + (only when V2 is selected AND backend exposes feature_groups). + Default selections are conservative: family=Baseline, + model_type=seasonal_naive, feature_frame=V1. +- `/visualize/backtest`: New per-horizon-bucket metric table beneath + the existing fold-metric chart, when `bucketed_aggregate_metrics` + is present in the response. New RMSE column when + `aggregate_metrics.rmse` is present. New baseline-vs-feature-aware + comparison view when `baseline_results` is non-empty AND + `comparison_summary` is populated. +- `/visualize/planner`: New "method" badge (`model_exogenous` | + `heuristic`) next to the run-id picker; "known future input" vs + "hypothetical" pill on each assumption row; baseline-vs-scenario + multi-series chart already exists — extended to label units delta + + revenue delta inline. +- `/explorer/run-detail`: New "Feature frame" panel showing + feature_frame_version + feature_groups when present; the panel + collapses gracefully (empty state) for pre-PRP-35 runs. +- `/explorer/run-compare`: New "Feature frame version" comparison row + in the metrics table; "Champion compatibility" badge that surfaces + the comparable-run rule's verdict (same grain + overlapping window + + same V). +- `/ops`: Stale-alias panel adds a `feature_frame_version_mismatch` + reason chip; degrading-status row exposes + `latest_wape / previous_wape / wape_delta / n_comparable_runs / + last_trained_at / staleness_days` (already in `ModelHealthEntry` — + this PRP surfaces them). +- `/visualize/batch`: Adds preset Select (5 presets) and a multi-model + multi-feature-pack matrix picker for batch sweeps. +- Every chat page: a "Use this context" copy button on the relevant + panels (run-detail, ops health card) that pre-fills the chat input + with a structured prompt; no new agent tool. + +### Technical requirements + +- TypeScript 5.9 strict — `pnpm tsc --noEmit` clean. +- ESLint clean — `pnpm lint` clean. +- Vitest 4 + @testing-library/react — every new component / hook / + conditional-rendering branch has a test; `pnpm test --run` clean. +- shadcn workflow per `.claude/rules/shadcn-ui.md` — every new component + arrives via `pnpm dlx shadcn@4.7.0 add …` from `frontend/` (NOT repo + root); no hand-rolled clones of components that exist in the registry. +- URL-shareable state preserved on every page that currently has it + (`/explorer/{stores,products,runs,jobs,sales}`, + `/visualize/{forecast,backtest,planner,demand,batch}`). +- RFC 7807 error mapping intact — surface `ApiError.detail.detail` (or + fallback to `.title`); never display the bare `.status`. +- No new backend routes. No new env vars. No managed-cloud SDK. + +### Success Criteria + +- [ ] Contract Probe (Task 1) succeeds: every PRP-35 / PRP-36 field this + PRP wires UI to is verified present (or its task is explicitly DEFERRED + with a note pointing at the absent field). +- [ ] `/visualize/forecast` segmented-control + model-type select + + feature-frame select + conditional feature-pack toggles render and + submit a TrainRequest the backend accepts. +- [ ] `/visualize/backtest` renders the horizon-bucket metric table when + the response contains `bucketed_aggregate_metrics`; falls back to a + no-buckets state when absent. +- [ ] `/visualize/backtest` shows RMSE column when `aggregate_metrics.rmse` + exists; column is omitted (not zero-padded) when absent. +- [ ] `/visualize/planner` labels each assumption row as + "known future input" or "hypothetical" per the existing + `is_known_future` flag (verify in Task 1; this PRP does NOT invent it). +- [ ] `/explorer/run-detail` "Feature frame" panel renders V1/V2 + groups + when present; renders empty-state when absent. +- [ ] `/explorer/run-compare` "Champion compatibility" badge follows the + comparable-run rule (same grain + overlap + same V); incompatible runs + display a warning chip. +- [ ] `/ops` stale-alias view supports the new + `feature_frame_version_mismatch` reason chip. +- [ ] `/ops` model-health view explains "degrading" via the WAPE delta + + comparable-run count + staleness fields already on + `ModelHealthEntry`. +- [ ] Promote dialog requires confirmation when latest WAPE > + previous_wape; surfaces artifact verification + champion/challenger + delta inline. +- [ ] `/visualize/batch` 5 presets work; the multi-model matrix picker + emits a valid `BatchSubmitRequest`. +- [ ] Every conditional-rendering branch has a Vitest test: + - missing feature_frame_version → empty state + - missing feature_groups → V2 toggles hidden + - HGBR explainability 422 → friendly "use lightgbm/xgboost for + importances" message (the existing pattern in + feature-importance-panel.tsx — confirm not weakened) + - random_forest (if shipped by PRP-36) → tree-importance variant + - stale alias with worse latest WAPE → Promote AlertDialog requires + explicit confirm + - artifact verification failed → red badge + tooltip with + `stored_hash` vs `computed_hash` +- [ ] No raw `from 'radix-ui'` imports introduced (verified by grep). +- [ ] No new `components/ui/*` file hand-rolled where a shadcn registry + component exists. +- [ ] `pnpm tsc --noEmit && pnpm lint && pnpm test --run` clean. +- [ ] Backend test suite still green + (`uv run pytest -v app/features/forecasting/tests app/features/backtesting/tests app/features/registry/tests app/features/ops/tests -m "not integration"`) + — this PRP touches no backend code. + +--- + +## All Needed Context + +### Documentation & References + +```yaml +# ─── Backend contract PRPs (Slice A + B) — load first ─────────────────── +- file: PRPs/PRP-35-forecast-intelligence-A-feature-frame-v2.md + why: V2 feature contract (FeatureGroup names, bundle.metadata fields, TrainRequest.feature_frame_version + feature_groups). Slice C consumes these as JSON. + +- file: PRPs/PRP-36-forecast-intelligence-B-model-zoo-backtesting.md + why: New model_types, RMSE, horizon_bucket_metrics shape, RunResponse.feature_frame_version + feature_groups, StaleReason.FEATURE_FRAME_VERSION_MISMATCH. Slice C consumes these as JSON. + +- file: PRPs/INITIAL/INITIAL-forecast-intelligence-C-interactive-ui.md + why: Source of truth for THIS PRP's scope. Re-read on disagreement. + +# ─── Project rules (enforce mechanically) ──────────────────────────────── +- file: .claude/rules/ui-design.md + why: UI workflow rule — Stitch / frontend-design / webapp-testing skill orchestration. The shadcn layer is governed by shadcn-ui.md below; ui-design.md governs the surrounding workflow (design system, browser verification). + +- file: .claude/rules/shadcn-ui.md + why: Mandatory shadcn workflow — invoke the shadcn skill + mcp__shadcn__* tools BEFORE writing any shadcn-touching code. Pin shadcn@4.7.0. From frontend/, NOT repo root. Verify project context (new-york, lucide, aliases) from frontend/components.json:1-23 first. + +- file: .claude/rules/test-requirements.md + why: Frontend testing matrix — every new component owning non-trivial state SHOULD have a vitest test; type-level changes MUST keep `pnpm tsc --noEmit` clean. + +- file: .claude/rules/output-formatting.md + why: Skill report shape only — does not gate UI code. Skip unless writing a skill. + +- file: .claude/rules/security-patterns.md + why: RFC 7807 error envelope is the only error shape the UI may parse; `verify=False` on outbound clients is forbidden (not applicable to UI). No secrets in code/logs (no client-side env vars carry secrets — VITE_API_BASE_URL is public). + +# ─── Frontend codebase anchors ───────────────────────────────────────── +- file: frontend/components.json + why: shadcn config — style=new-york, iconLibrary=lucide, aliases @/components @/ui @/lib @/hooks → src/. The shadcn CLI runs from frontend/, not repo root (otherwise it fails to find this file). + +- file: frontend/package.json + why: Versions. React 19.2, Vite 7.2, Tailwind 4.1, react-router-dom 7.13, @tanstack/react-query 5.90, @tanstack/react-table 8.21, recharts 2.15, vitest 4.1, @testing-library/react 16.3, lucide-react 0.563, date-fns 4.1, react-day-picker 9.13, next-themes 0.4.6. Per-component @radix-ui/* pinned. + +- file: frontend/src/types/api.ts + why: Source of truth for backend wire types. Extended by THIS PRP additively when PRP-35 / PRP-36 field names are confirmed in Task 1. Existing shapes anchored — ForecastPoint L102-107, FeatureMetadataResponse L216-223, ModelRun L179-203, Alias L229-237, RunCompareResponse L239-244, Job L261-274, BatchSubmitRequest L347-355, BatchSubmitResponse L357-375, BatchItemResponse L377-395, OpsSummaryResponse L790-798, ModelHealthEntry L830-843, RetrainingCandidate L801-810, ScenarioAssumptions L884-890, ScenarioComparison L923-943, MultiScenarioComparison L1000-1008, ForecastExplanation L1036-1048, ProblemDetail L540-549. + +- file: frontend/src/lib/api.ts + why: Typed fetch wrapper L23-92. RFC 7807 parsed at the error path (matches `application/problem+json` MIME). `getErrorMessage(error)` is the canonical extractor; never display raw `.status`. + +- file: frontend/src/lib/url-params.ts + why: parsePageParam L17-25, parseIdParam L27-35, parseEnumParam L37-48. New URL params (e.g. `feature_frame_version`, `feature_groups`) use parseEnumParam against the FeatureGroup enum values delivered by PRP-35. + +- file: frontend/src/App.tsx + why: Routing skeleton. Routes via ROUTES constants — DASHBOARD '/', SHOWCASE, OPS, EXPLORER.* (SALES/STORES/PRODUCTS/RUNS/JOBS), VISUALIZE.* (FORECAST/BACKTEST/DEMAND/PLANNER/BATCH), CHAT, KNOWLEDGE, GUIDE, ADMIN. Lazy-loaded + Suspense fallback. NO new routes. + +- file: frontend/src/components/layout/top-nav.tsx + why: NavigationMenu + mobile Sheet pattern. NO change to nav entries; new affordances are page-internal. + +# ─── Pages this PRP modifies ─────────────────────────────────────────── +- file: frontend/src/pages/visualize/forecast.tsx + why: Current HORIZON_OPTIONS, train job picker, showInterval, CSV export. ADD: family Tabs, model_type Select filtered by family, feature_frame Select (V1/V2), feature_groups toggle group. Default = (Baseline, seasonal_naive, V1). + +- file: frontend/src/pages/visualize/backtest.tsx + why: Current 7-model selector, date range, n_splits, BacktestFoldsChart. ADD: RMSE column when present; horizon-bucket metric table when `bucketed_aggregate_metrics` present; baseline-vs-feature-aware comparison view when both present. + +- file: frontend/src/pages/visualize/planner.tsx + why: Baseline job picker, ScenarioAssumptions form. ADD: method badge (`model_exogenous` | `heuristic`); known-future-input vs hypothetical pills. + +- file: frontend/src/pages/explorer/run-detail.tsx + why: Run metadata + ExplanationPanel + FeatureImportancePanel. ADD: Feature frame panel showing V1/V2 + groups + safety_classes. + +- file: frontend/src/pages/explorer/run-compare.tsx + why: Two-run side-by-side, DeltaCell, config_diff, metrics_diff. ADD: Feature frame version row; Champion compatibility badge. + +- file: frontend/src/pages/ops.tsx + why: OpsSummary + RetrainingCandidates + ModelHealth + Promote dialog. ADD: feature_frame_version_mismatch reason chip; degrading-explainer fields; safer Promote AlertDialog. + +- file: frontend/src/pages/visualize/batch.tsx + why: Current submit form, PRP-34 max_parallel slider + cancel AlertDialog. ADD: 5 preset Select; multi-model multi-feature-pack matrix picker. + +# ─── Hooks this PRP modifies or adds ─────────────────────────────────── +- file: frontend/src/hooks/use-runs.ts + why: Existing query keys L24-56. Extend useRuns query params to accept feature_frame_version filter (when backend supports it — verify Task 1). + +- file: frontend/src/hooks/use-ops.ts + why: useOpsSummary refetchInterval 15s, useRetrainingCandidates, useModelHealth. NO new hooks; consume new fields from existing response shapes. + +- file: frontend/src/hooks/use-feature-metadata.ts + why: useRunFeatureMetadata(runId, enabled). The existing retry:false stays. Slice C reads new feature_groups / safety_classes from the same response when present. + +- file: frontend/src/hooks/use-jobs.ts + why: useJobs polling + useJob refetchInterval. NO change to logic; consume new fields when present. + +- file: frontend/src/hooks/use-batches.ts + why: Submit + cancel + items pagination. NO change to logic; presets are a UI concept that emits the same BatchSubmitRequest shape. + +# ─── Existing components this PRP modifies ───────────────────────────── +- file: frontend/src/components/charts/backtest-folds-chart.tsx + why: Bar chart over fold metrics. ADD a sibling `BacktestHorizonBucketsChart` for per-bucket WAPE / RMSE (do NOT extend this one — the data shape is different). + +- file: frontend/src/components/charts/multi-series-chart.tsx + why: Existing multi-scenario plotter. Reused for baseline-vs-feature-aware backtest comparison view. + +- file: frontend/src/components/data-table/data-table.tsx + why: Generic TanStack table wrapper L41-100. NEW columns added by passing ColumnDef arrays; no change to the generic. + +- file: frontend/src/components/common/status-badge.tsx + why: CVA variants — default/success/warning/error/info/pending. REUSED for "feature-aware" / "baseline" / "stale" / "degrading" / "stockout-constrained" badges (variant=info | warning). + +- file: frontend/src/components/common/model-family-badge.tsx + why: family ∈ ('baseline','tree','additive') → secondary+Activity / default+TreePine / outline+LineChart. REUSED. + +- file: frontend/src/components/explainability/explanation-panel.tsx + why: ForecastExplanation drivers + reason codes + confidence + caveats. NO weakening; reused as-is. + +- file: frontend/src/components/explainability/feature-importance-panel.tsx + why: Handles 400 (baseline), 404 (missing), 422 (HGBR — FeatureImportanceUnavailableError). The 422 path is the load-bearing user-facing message; DO NOT weaken. If PRP-36 ships random_forest, this panel renders a new "tree" variant. + +# ─── shadcn registry components installed today ──────────────────────── +- file: frontend/src/components/ui/tabs.tsx + why: USED AS the segmented control for model-family picker (no separate segmented-control primitive exists in shadcn). + +- file: frontend/src/components/ui/select.tsx + why: Model-type, feature-frame-version, batch-preset. + +- file: frontend/src/components/ui/checkbox.tsx + why: Feature-pack toggles (conditional, V2-only). + +- file: frontend/src/components/ui/slider.tsx + why: Price-delta and quick-vary inputs in the planner page. + +- file: frontend/src/components/ui/dialog.tsx + alert-dialog.tsx + why: Promote confirmation when latest WAPE > previous_wape. + +- file: frontend/src/components/ui/tooltip.tsx + why: Disabled-state explanations (e.g. "V2 unavailable — server has not shipped Forecast Intelligence A"). + +- file: frontend/src/components/ui/badge.tsx + why: Status / family / mismatch chips. + +- file: frontend/src/components/ui/table.tsx + why: Horizon-bucket metric table; comparable-run table. + +# ─── Test patterns ───────────────────────────────────────────────────── +- file: frontend/src/components/common/model-family-badge.test.tsx + why: Pattern for badge-shape tests (asserts icon + variant per family). + +- file: frontend/src/components/explainability/feature-importance-panel.test.tsx + why: Pattern for conditional-rendering tests against error states (400/404/422). + +- file: frontend/src/lib/url-params.test.ts + why: Pattern for URL-param parsing unit tests. + +- file: frontend/src/hooks/use-batches.test.ts + why: Pattern for hook tests (query key shape + refetch interval). + +- file: frontend/src/hooks/use-demo-pipeline.test.ts + why: WebSocket-driven hook test pattern (NOT needed for this PRP but available as a reference). + +# ─── External docs (load on demand) ──────────────────────────────────── +- url: https://ui.shadcn.com/docs/components/tabs + section: "Anatomy" + "Examples → Vertical" + critical: Tabs styled with `variant` + bold border-bottom is the project's segmented-control look. NEVER hand-roll a "SegmentedControl" component. + +- url: https://www.radix-ui.com/primitives/docs/components/slider + section: "API" + critical: Used for price-delta slider; `min`, `max`, `step`, `defaultValue: [number]` (array), `onValueChange: (vals: number[]) => void`. Note: shadcn's slider wraps `@radix-ui/react-slider`; do NOT import the Radix barrel. + +- url: https://tanstack.com/query/latest/docs/framework/react/guides/query-keys + section: "Query Key Hashing" + critical: New URL params land in the query key tuple after the page+pageSize prefix to keep invalidation stable. + +- url: https://tanstack.com/table/latest/docs/api/core/column-def + section: "ColumnDef" + critical: New horizon-bucket columns are dynamic — the bucket id set depends on `bucketed_aggregate_metrics` keys at response time. Build ColumnDef[] at render time, NOT module-load time. + +- url: https://recharts.org/en-US/api/ComposedChart + section: "Props" + critical: `data` MUST be an array of plain objects with stable keys. Bucket-aggregate chart maps {bucket_id: wape} into {bucket: 'h_1_7', value: 12.4}. + +- url: https://date-fns.org/v4.1.0/docs/format + section: "Format tokens" + critical: Use `format(date, 'yyyy-MM-dd')` for backend-facing ISO dates; never the locale-dependent `'PP'`. + +# ─── Memory anchors ──────────────────────────────────────────────────── +- memory: shadcn-cli-version-pin + why: Pin `shadcn@4.7.0`. 5.x silently writes a stub pnpm-workspace.yaml and skips the component install. + +- memory: radix-ui-vs-per-component-imports + why: This project uses `@radix-ui/react-X` per-component packages. Never `from 'radix-ui'` (the barrel shadcn 5.x emits). Grep + fix any newly-added file. + +- memory: playwright-dogfood-snap-chromium + why: Dogfood via the `webapp-testing` skill (or native Python Playwright with executable_path=/snap/bin/chromium). Playwright MCP fails on this host. + +- memory: dogfood-stale-uvicorn-port-8123 + why: Check `ps -ef | grep uvicorn` for stale processes before claiming UI changes work; a previous-session uvicorn may serve stale code on :8123. + +- memory: stale uvicorn pattern + why: Same as above — surface as a HANDOFF note when smoke-testing. + +- memory: computed-field-cross-slice-cycle + why: Backend-side concern (Pydantic computed_field cycling across slices). Frontend simply consumes the resulting JSON; this memory is a sanity check, not a constraint here. +``` + +### Current Codebase tree (relevant frontend) + +``` +frontend/ +├── components.json # shadcn config +├── package.json # versions (React 19, Vite 7, Tailwind 4) +├── src/ +│ ├── App.tsx # routes +│ ├── lib/ +│ │ ├── api.ts # typed fetch + RFC 7807 +│ │ ├── url-params.ts # parsePageParam, parseIdParam, parseEnumParam +│ │ ├── scenario-utils.ts +│ │ ├── ops-actions.ts +│ │ └── ops-utils.ts +│ ├── types/ +│ │ └── api.ts # backend wire types — additive +│ ├── hooks/ +│ │ ├── use-runs.ts +│ │ ├── use-jobs.ts +│ │ ├── use-ops.ts +│ │ ├── use-batches.ts +│ │ ├── use-feature-metadata.ts +│ │ ├── use-explanations.ts +│ │ ├── use-scenarios.ts +│ │ ├── use-config.ts +│ │ ├── use-stores.ts / use-products.ts / use-kpis.ts / use-timeseries.ts / use-drilldowns.ts / use-inventory.ts / use-lifecycle-curve.ts / use-rag-sources.ts / use-seeder.ts / use-websocket.ts / use-demo-pipeline.ts +│ ├── pages/ +│ │ ├── visualize/{forecast,backtest,planner,demand,batch}.tsx +│ │ ├── explorer/{run-detail,run-compare,runs,jobs,stores,products,sales,store-detail,product-detail,job-detail}.tsx +│ │ ├── ops.tsx +│ │ └── … +│ ├── components/ +│ │ ├── ui/ # 27 shadcn components (tabs, select, checkbox, slider, dialog, alert-dialog, tooltip, table, badge, …) +│ │ ├── charts/{backtest-folds-chart,multi-series-chart,time-series-chart,kpi-card,revenue-bar-chart}.tsx +│ │ ├── common/{model-family-badge,status-badge,date-range-picker,job-picker,json-block,loading-state,error-display}.tsx +│ │ ├── data-table/{data-table,data-table-column-header,data-table-pagination,data-table-toolbar,data-table-view-options}.tsx +│ │ ├── explainability/{explanation-panel,feature-importance-panel}.tsx +│ │ ├── chat/{chat-message,chat-input,tool-call-display}.tsx +│ │ ├── admin/ai-models-panel.tsx +│ │ ├── demo/demo-step-card.tsx +│ │ └── layout/{app-shell,top-nav,theme-toggle}.tsx +│ └── providers/ +│ └── theme-provider.tsx # next-themes +└── vitest.config.ts # jsdom; src/**/*.test.{ts,tsx} +``` + +### Desired Codebase tree (additive + modified files) + +``` +frontend/ +├── src/ +│ ├── types/ +│ │ └── api.ts # MODIFIED — extend TrainRequest, BacktestResponse, RunResponse, StaleAliasResponse, FeatureMetadataResponse to mirror Task 1's confirmed contract. ALL new fields are Optional. +│ ├── lib/ +│ │ ├── feature-frame-utils.ts # NEW — FeatureGroup enum mirror (defensive copy of backend), labelForGroup(group), safetyClassChipVariant(safety), isV2Available(features) +│ │ └── horizon-bucket-utils.ts # NEW — HORIZON_BUCKET_IDS, labelForBucket(id), sortBuckets(ids[]) +│ ├── hooks/ +│ │ └── use-runs.ts # MODIFIED — accept optional feature_frame_version filter param when backend supports it (gated by isV2Available) +│ ├── pages/ +│ │ ├── visualize/ +│ │ │ ├── forecast.tsx # MODIFIED — segmented family Tabs + model_type Select + feature_frame Select + conditional feature_groups toggle group +│ │ │ ├── backtest.tsx # MODIFIED — RMSE column + horizon-bucket metric table + baseline-vs-feature-aware comparison view +│ │ │ ├── planner.tsx # MODIFIED — method badge + known-future-input vs hypothetical pills +│ │ │ ├── batch.tsx # MODIFIED — 5 preset Select + multi-model multi-feature-pack matrix picker +│ │ │ └── demand.tsx # UNCHANGED in this PRP (separate scope) +│ │ ├── explorer/ +│ │ │ ├── run-detail.tsx # MODIFIED — Feature frame panel +│ │ │ └── run-compare.tsx # MODIFIED — Feature frame version row + Champion compatibility badge +│ │ └── ops.tsx # MODIFIED — feature_frame_version_mismatch chip + degrading-explainer + safer Promote AlertDialog +│ ├── components/ +│ │ ├── forecast-intelligence/ # NEW folder (cohesive feature surface) +│ │ │ ├── model-family-tabs.tsx # NEW — Tabs styled as segmented control; (family: ModelFamily, onChange) +│ │ │ ├── model-type-select.tsx # NEW — Select filtered by family; (family, value, onChange, availableModels: list from Task 1) +│ │ │ ├── feature-frame-select.tsx # NEW — Select V1 | V2; (value, onChange, isV2Available: bool, disabledReason?) +│ │ │ ├── feature-groups-toggle.tsx # NEW — multi-select Checkbox group of FeatureGroup; (value, onChange, availableGroups: list from Task 1) +│ │ │ ├── horizon-bucket-table.tsx # NEW — Table rendering bucketed_aggregate_metrics +│ │ │ ├── champion-compatibility-badge.tsx # NEW — Badge with tooltip explaining same grain / window / V rule +│ │ │ ├── feature-frame-panel.tsx # NEW — read-only summary of feature_frame_version + feature_groups + safety_classes (used in run-detail) +│ │ │ ├── promote-confirmation-dialog.tsx # NEW — AlertDialog with artifact-verify + WAPE-delta warning when worse-newer +│ │ │ ├── batch-preset-select.tsx # NEW — 5 hardcoded presets +│ │ │ └── batch-matrix-picker.tsx # NEW — multi-model × multi-feature-pack matrix +│ │ ├── charts/ +│ │ │ └── backtest-horizon-buckets-chart.tsx # NEW — sibling to backtest-folds-chart for per-bucket WAPE +│ │ └── explainability/ +│ │ └── feature-importance-panel.tsx # MODIFIED ONLY IF PRP-36 ships random_forest — add a 'tree (random_forest)' label branch +│ └── pages/__tests__/ # not used; tests are colocated next to source +└── (No new directories outside src/) +``` + +### Known Gotchas of our codebase & Library Quirks + +```typescript +// ───────────────────────────────────────────────────────────────────────── +// CRITICAL: This PRP MUST NOT pretend PRP-35 / PRP-36 landed. +// ───────────────────────────────────────────────────────────────────────── +// +// Task 1 (Contract Probe) is the gate. It runs against the live backend +// schemas AND the test fixtures and produces a structured report: +// - feature_frame_version: PRESENT | ABSENT +// - feature_groups: PRESENT | ABSENT +// - rmse: PRESENT | ABSENT +// - bucketed_aggregate_metrics: PRESENT | ABSENT +// - StaleReason.FEATURE_FRAME_VERSION_MISMATCH: PRESENT | ABSENT +// - random_forest model_type: PRESENT | ABSENT +// - weighted_moving_average / seasonal_average / trend_regression_baseline: PRESENT | ABSENT +// If ANY field is ABSENT, the dependent UI task is DEFERRED — implementer +// MUST NOT scaffold a placeholder. The corresponding feature flag in +// lib/feature-frame-utils.ts (e.g. isV2Available()) reflects this at +// runtime so the affected control renders disabled + with a tooltip. + +// ───────────────────────────────────────────────────────────────────────── +// Repo + framework gotchas (verified or anchored): +// ───────────────────────────────────────────────────────────────────────── + +// - shadcn CLI: pin 4.7.0 (memory `shadcn-cli-version-pin`). Run from +// frontend/, not repo root. Example: +// cd frontend && pnpm dlx shadcn@4.7.0 add tabs # NO `@latest` +// shadcn 5.x silently writes a stub pnpm-workspace.yaml and the +// component never lands. + +// - Radix imports: per-component only (memory `radix-ui-vs-per-component- +// imports`). shadcn 5.x writes `from 'radix-ui'` for new primitives; +// if that happens, find/replace to `@radix-ui/react-X` before committing. +// Grep guard for CI: +// grep -rn "from 'radix-ui'" frontend/src # MUST be empty +// grep -rn 'from "radix-ui"' frontend/src # MUST be empty + +// - Tabs as segmented control: shadcn has no SegmentedControl primitive. +// Use with a `variant=segmented` class composition. NEVER +// hand-roll a SegmentedControl component. + +// - Recharts on Tailwind 4: chart colour vars are `--chart-1` … `--chart-5` +// (already wired into the project's `index.css`). New charts pull from +// these CSS variables via the shadcn chart wrapper, not from raw hex. + +// - TanStack Query key shape: dataKey for ('runs', filters) where +// `filters` is an OBJECT (not a JSON-stringified key). New filter fields +// land in the same object — invalidation by `['runs']` continues to +// match every nested filter. + +// - TanStack Table sorting + pagination: SERVER-DRIVEN (manualPagination, +// manualSorting). Local state stays in the page component; pass +// `pageCount` from the response. + +// - URL params: every new param goes through `parseEnumParam` against +// a frozen tuple of allowed values. For `feature_frame_version`, the +// tuple is `['1', '2'] as const` and parsed → 1 | 2 | undefined. + +// - Lazy-loaded routes: every new page-level component is loaded via +// `React.lazy(() => import('./...'))`. The PageLoader fallback is +// already wired in App.tsx; do NOT introduce a new fallback. + +// - ApiError detail: ALWAYS read `error.detail?.detail || error.detail?.title +// || error.message`. NEVER display the raw `.status` number to the user +// (per security-patterns.md — information disclosure via stack traces). + +// - Date inputs: backend wants 'yyyy-MM-dd'. date-fns `format(d, 'yyyy-MM-dd')`. +// NEVER `d.toISOString().slice(0, 10)` — TZ-sensitive. + +// - vitest + jsdom: no globals enabled in vitest.config.ts. Import `describe, +// it, expect, vi` from 'vitest' in EVERY test file. + +// - Testing async hooks: wrap with `renderHook(... , { wrapper: QueryClient +// Wrapper })` per the existing pattern in use-batches.test.ts. Provide +// a fresh QueryClient per test to avoid cache leakage. + +// - shadcn workflow rule enforcement: per `.claude/rules/shadcn-ui.md` +// §"Workflow", invoke the `shadcn` skill BEFORE adding any new component +// from the registry. The skill loads project context and the +// composition rules; the MCP tools (`mcp__shadcn__*`) handle discovery +// + install commands. Audit checklist comes from +// `mcp__shadcn__get_audit_checklist` AFTER install. + +// - Dogfood: per memory `playwright-dogfood-snap-chromium`, the +// `webapp-testing` skill is the path (or native Python Playwright with +// executable_path=/snap/bin/chromium). Playwright MCP fails on this +// host. Per memory `dogfood-stale-uvicorn-port-8123`, check `ps etime` +// on uvicorn before trusting :8123 — a previous-session process may +// serve stale code. + +// - Tailwind 4 vs 3: arbitrary values use new syntax (e.g. `bg-(--chart-1)` +// for CSS variable refs). Most code uses semantic tokens, so this is +// rarely an issue. + +// - StatusBadge variants: 'default' | 'success' | 'warning' | 'error' | +// 'info' | 'pending'. For "feature-aware" use 'info'; "baseline" use +// 'default'; "stale" use 'warning'; "degrading" use 'warning'; +// "stockout-constrained" use 'warning'; "best WAPE" use 'success'; +// "artifact verified" use 'success'; "verification failed" use 'error'. + +// - Tooltip: use the existing Tooltip component for every disabled +// control. Disabled-without-explanation is a UX regression. + +// - ConditionalRendering: the implementer's pattern for "render if backend +// has the field" is `feature_frame_version !== undefined`. NEVER +// `feature_frame_version === 1` (that would render the V1 chip on V1 runs +// but hide the chip on pre-PRP-35 runs — semantically different). +// Hide the entire Feature-frame panel only when the field is `undefined` +// AND `feature_groups === undefined`. +``` + +--- + +## Implementation Blueprint + +### Data models and structure (additive types) + +```typescript +// frontend/src/types/api.ts — additions (CONFIRM each in Task 1) + +export type FeatureFrameVersion = 1 | 2; + +// Defensive copy of PRP-35 FeatureGroup enum. Implementer MUST keep this +// in sync with app/shared/feature_frames/contract_v2.py:FeatureGroup — +// Task 1 verifies the values match. +export type FeatureGroup = + | 'target_history' + | 'rolling' + | 'trend' + | 'calendar' + | 'price_promo' + | 'inventory' + | 'lifecycle' + | 'replenishment' + | 'returns' + | 'exogenous_weather' + | 'exogenous_macro'; + +export const FEATURE_GROUP_VALUES = [ + 'target_history','rolling','trend','calendar','price_promo','inventory', + 'lifecycle','replenishment','returns','exogenous_weather','exogenous_macro', +] as const satisfies readonly FeatureGroup[]; + +export type FeatureSafetyClass = + | 'safe' + | 'conditionally_safe' + | 'unsafe_unless_supplied'; + +// Backend wire shape additions — ALL Optional, all read defensively. + +export interface TrainRequest { + // existing fields preserved … + feature_frame_version?: FeatureFrameVersion; // PRP-35 + feature_groups?: FeatureGroup[]; // PRP-35 (V2 only) +} + +export interface ModelRun { + // existing fields preserved … + feature_frame_version?: FeatureFrameVersion; // PRP-36 + feature_groups?: Partial>; // PRP-36 +} + +export interface FeatureMetadataResponse { + // existing fields preserved … + feature_frame_version?: FeatureFrameVersion; // PRP-35 + feature_groups?: Partial>; // PRP-35 + feature_safety_classes?: Record; // PRP-35 +} + +// BacktestResponse additions — additive sub-fields. +export interface FoldResult { + // existing fields … + horizon_bucket_metrics?: Record>; // PRP-36 +} +export interface AggregateMetrics { + // existing mae/smape/wape/bias/stability … + rmse?: number; // PRP-36 +} +export interface ModelBacktestResult { + // existing aggregate_metrics, fold_results, … + bucketed_aggregate_metrics?: Record>; // PRP-36 +} + +// Ops additions +export type StaleReason = + | 'newer_success_run' + | 'artifact_not_verified' + | 'run_not_success' + | 'feature_frame_version_mismatch'; // PRP-36 (NEW value) + +export interface StaleAliasResponse { + // existing fields … + alias_feature_frame_version?: FeatureFrameVersion; // PRP-36 + comparable_run_feature_frame_version?: FeatureFrameVersion; // PRP-36 +} +``` + +### List of tasks to be completed (dependency-ordered) + +```yaml +Task 1 — CONTRACT PROBE (gates every other task): + - VERIFY which PRP-35 / PRP-36 fields are present in the live backend by: + a) Reading `app/features/forecasting/schemas.py` and confirming `TrainRequest.feature_frame_version` + `feature_groups` exist. + b) Reading `app/features/backtesting/schemas.py` and confirming `FoldResult.horizon_bucket_metrics`, `AggregateMetrics.rmse`, `ModelBacktestResult.bucketed_aggregate_metrics`. + c) Reading `app/features/registry/schemas.py` and confirming `RunResponse.feature_frame_version` + `feature_groups`. + d) Reading `app/features/ops/schemas.py` and confirming `StaleReason.FEATURE_FRAME_VERSION_MISMATCH`. + e) Reading `app/features/forecasting/models.py` factory branch list and capturing the SUPERSET of `model_type` values the backend dispatches. + - PRODUCE a Task 1 report (commit as `docs/contract-probe-report.md` under PRPs/ai_docs/) listing every probed field with PRESENT / ABSENT + the source file:line. + - FOR each ABSENT field, FLAG the dependent Task as DEFERRED in the PR description AND in the comment block at the top of the affected file. Implementer MUST NOT scaffold a placeholder for an ABSENT field. + - VERIFY also that: + - The `BacktestRequest.config` (model_config field) accepts the new model_type values from PRP-36 (read the discriminated union in forecasting/schemas.py). + - `forecast_enable_random_forest` setting (if added by PRP-36 Task 5) is exposed to the UI via `/config/ai` or remains a server-side-only gate (the latter is acceptable — the UI catches the 422 from the train route and renders the unsupported message). + - If PRP-35 surface (FeatureGroup, feature_frame_version on TrainRequest) is ABSENT: STOP. This PRP cannot execute. + - If PRP-36 surface is partially ABSENT: continue with the [gate:PRP-35]-tagged tasks only. + +Task 2 — CREATE frontend/src/lib/feature-frame-utils.ts: + - EXPORT type FeatureFrameVersion = 1 | 2. + - EXPORT type FeatureGroup + FEATURE_GROUP_VALUES (mirror of PRP-35 enum). Note: this is a DEFENSIVE COPY; runtime backend membership is the authoritative check via Task 1. + - EXPORT labelForGroup(group: FeatureGroup): string — UI-facing labels ("Target history", "Rolling means", "Yearly seasonality"…). Map captured from `docs/optional-features/10-baseforecaster-feature-contract.md` (PRP-35 V2 section). + - EXPORT safetyClassChipVariant(safety: FeatureSafetyClass): BadgeVariant — 'safe' → 'success', 'conditionally_safe' → 'warning', 'unsafe_unless_supplied' → 'error'. + - EXPORT isV2Available(featureMetadata: FeatureMetadataResponse | undefined): bool — returns true iff `featureMetadata?.feature_frame_version === 2 || (featureMetadata?.feature_groups && Object.keys(featureMetadata.feature_groups).length > 0)`. + - EXPORT defaultV2Groups(): FeatureGroup[] — the V2 default subset for the UI's "use defaults" affordance. Sources from PRP-35 DEFAULT_V2_GROUPS = (target_history, rolling, trend, calendar, price_promo, lifecycle). Hard-coded here; Task 1 verifies match. + - ADD test file feature-frame-utils.test.ts: every exported function on every branch. + +Task 3 — CREATE frontend/src/lib/horizon-bucket-utils.ts: + - EXPORT HORIZON_BUCKET_IDS = ['h_1_7', 'h_8_14', 'h_15_28', 'h_29_plus'] as const. + - EXPORT labelForBucket(id) → 'Days 1-7' | 'Days 8-14' | 'Days 15-28' | 'Days 29+'. + - EXPORT sortBuckets(ids: string[]): string[] — stable order matching HORIZON_BUCKET_IDS, unknown bucket ids appended at the end. + - ADD test file: every label + sort + unknown handling. + +Task 4 — MODIFY frontend/src/types/api.ts: + - ADD the type extensions in the "Data models and structure" section above. EVERY new field is Optional. + - PRESERVE every existing exported type. + - ADD a JSDoc note on each new field citing the PRP that ships it (PRP-35 or PRP-36). + - DO NOT remove or rename any existing field. + +Task 5 — CREATE frontend/src/components/forecast-intelligence/model-family-tabs.tsx [gate:always]: + - INVOKE the `shadcn` skill first; CONFIRM components/ui/tabs.tsx is present. + - IMPLEMENT a Tabs-as-segmented-control component: + props: { family: ModelFamily; onChange: (f: ModelFamily) => void; disabled?: boolean } + values: 'baseline' | 'tree' | 'additive' (mirror frontend/src/components/common/model-family-badge.tsx variants). + visual: shadcn Tabs primitive with a `variant=segmented` Tailwind class composition (a thin rounded-md border + a sliding-bg active state). NO custom segmented-control file. + - ADD test: each value selects + emits onChange; disabled state blocks emission. + +Task 6 — CREATE frontend/src/components/forecast-intelligence/model-type-select.tsx [gate:always]: + - props: { family: ModelFamily; value: string; onChange: (modelType: string) => void; availableModels: string[]; disabled?: boolean }. + - When family changes, the Select options narrow to model_types whose ModelFamily matches (computed via a static map mirroring backend `_MODEL_FAMILY_MAP`). + - Defensive: if `value` is incompatible with the new family, the parent component MUST reset value — but the component itself does NOT reset (avoid unexpected resets if the parent has its own logic). + - ADD test: family change narrows options; emits onChange on selection. + +Task 7 — CREATE frontend/src/components/forecast-intelligence/feature-frame-select.tsx [gate:PRP-35]: + - props: { value: FeatureFrameVersion; onChange: (v: FeatureFrameVersion) => void; isV2Available: boolean; v2DisabledReason?: string }. + - Renders shadcn Select with 'V1' and 'V2' options; V2 disabled when !isV2Available with a Tooltip rendering `v2DisabledReason` (default: "V2 unavailable — server has not shipped Forecast Intelligence A"). + - ADD test: when isV2Available=false, the V2 option is disabled AND a Tooltip renders; onChange respected for V1. + +Task 8 — CREATE frontend/src/components/forecast-intelligence/feature-groups-toggle.tsx [gate:PRP-35]: + - props: { value: FeatureGroup[]; onChange: (groups: FeatureGroup[]) => void; availableGroups: FeatureGroup[]; defaults: FeatureGroup[]; disabled?: boolean }. + - Renders a vertical Checkbox list (shadcn Checkbox component); a "Use defaults" button resets to `defaults`; an empty selection emits a 0-element array (the parent decides whether to send `undefined` instead of `[]` to the backend). + - Each group label uses labelForGroup; each row shows a safety-class chip if safety_classes is available (otherwise omitted). + - ADD test: toggle on/off; use-defaults; empty selection emits []; safety chip renders when supplied. + +Task 9 — CREATE frontend/src/components/forecast-intelligence/horizon-bucket-table.tsx [gate:PRP-36]: + - props: { bucketed: Record> | undefined; metric: 'mae' | 'smape' | 'wape' | 'bias' | 'rmse'; metricLabel?: string }. + - Renders a shadcn Table with one row per bucket (sorted via sortBuckets); columns = bucket id, bucket label, metric value (formatted to 2 decimals). + - Empty state: when `bucketed` is undefined or empty, renders "No horizon-bucket metrics available" inside the Card. + - ADD test: renders 4 buckets in order; empty state when undefined; unknown-bucket appended. + +Task 10 — CREATE frontend/src/components/forecast-intelligence/feature-frame-panel.tsx [gate:PRP-35]: + - props: { feature_frame_version?: FeatureFrameVersion; feature_groups?: Partial>; feature_safety_classes?: Record; isLoading?: boolean }. + - Renders a Card with: + - the version chip (V1 / V2 — version=1 uses 'default', version=2 uses 'info' variant). + - per-group list when V2 — each group name (label) + collapsed columns (shadcn Collapsible — Slice C already uses it in /admin). + - per-column safety-class chip when safety_classes is supplied. + - Empty state: when both fields are undefined → "Feature frame information not available (pre-PRP-35 run)." + - ADD test: each branch (V1 / V2 with groups / V2 with safety / empty). + +Task 11 — CREATE frontend/src/components/forecast-intelligence/champion-compatibility-badge.tsx [gate:PRP-36]: + - props: { runA: ModelRun; runB: ModelRun }. + - Computes compatibility: SAME (store_id, product_id) AND windows OVERLAP AND SAME feature_frame_version (treating undefined as 1). + - Renders a Badge variant=success ("Comparable") or variant=warning ("Not comparable — different feature frame version" OR "Not comparable — no data window overlap" OR "Not comparable — different grain"). + - Tooltip carries the precise reason. + - ADD test: every reason branch + the "comparable" success branch. + +Task 12 — CREATE frontend/src/components/forecast-intelligence/promote-confirmation-dialog.tsx [gate:always]: + - props: { open: boolean; onOpenChange: (open: boolean) => void; run: ModelRun; currentChampion?: ModelRun; onConfirm: () => Promise; isPromoting: boolean }. + - Renders shadcn AlertDialog: + - Headline: "Promote run {run.run_id.slice(0,8)} to alias `production`?" + - If `currentChampion` exists AND `run.metrics.wape > currentChampion.metrics.wape`: a red callout "Latest WAPE is HIGHER than current champion (X% > Y%)" — confirmation requires checking a "I understand promoting a worse run" checkbox. + - If `run.artifact_hash` does not match a freshly-computed verify: a red "Artifact verification failed" callout (the verify call is the existing useArtifactVerify hook). + - If `currentChampion?.feature_frame_version !== run.feature_frame_version`: an amber callout "Feature frame version mismatch — promotion will silently change the contract this alias represents". + - "Promote" button is disabled until every warning is acknowledged. + - ADD tests: each branch (worse-WAPE requires checkbox; verify-fail blocks; V-mismatch requires acknowledge; clean promote auto-enables). + +Task 13 — CREATE frontend/src/components/forecast-intelligence/batch-preset-select.tsx [gate:always]: + - props: { value: PresetId; onChange: (preset: PresetId) => void }. + - Hardcoded presets: + - 'quick_baseline_sweep' → naive + seasonal_naive + moving_average + (if PRP-36) weighted_moving_average + seasonal_average. + - 'feature_aware_comparison' → regression + (gated) lightgbm + (gated) xgboost + prophet_like + (PRP-36) random_forest; feature_frame_version=2 + defaultV2Groups(). + - 'champion_challenger_refresh' → current champion model_type + the next best WAPE family. + - 'stockout_sensitive_products' → regression + V2 with `inventory` + `replenishment` + `returns` groups enabled. + - 'high_wape_recovery' → all available feature-aware models + V2 with defaults. + - The component emits the preset id; the parent (`pages/visualize/batch.tsx`) translates the preset into a `BatchSubmitRequest`. + +Task 14 — CREATE frontend/src/components/forecast-intelligence/batch-matrix-picker.tsx [gate:always]: + - props: { availableModels: string[]; availableGroups: FeatureGroup[]; value: { model_type: string; feature_frame_version: FeatureFrameVersion; feature_groups: FeatureGroup[] }[]; onChange: (rows: …) => void; max_rows?: number }. + - Renders a Checkbox grid: one row per available model, one column per (frame version × group set). User toggles cells to build a list of (model_type, version, groups) tuples the batch will sweep. + - Cap at max_rows (default 24); render an error chip when exceeded. + - ADD tests: add/remove rows; respect cap. + +Task 15 — CREATE frontend/src/components/charts/backtest-horizon-buckets-chart.tsx [gate:PRP-36]: + - props: { bucketed: Record> | undefined; metric: 'mae' | 'smape' | 'wape' | 'bias' | 'rmse' }. + - Recharts ComposedChart (or BarChart): X = bucket label, Y = metric. Data built from `bucketed` via sortBuckets. + - Empty state matches the bucket-table empty state. + - ADD test: renders bars for each bucket; empty state when undefined. + +Task 16 — MODIFY frontend/src/pages/visualize/forecast.tsx: + - INSERT the new control row above the existing form: . + - Wire each control to local React state; on submit, build a TrainRequest with the new optional fields ONLY when set (avoid sending `feature_frame_version: 1` explicitly — backend treats absent as V1). + - PRESERVE the existing horizon selector + showInterval + CSV export. + - PRESERVE URL-shareable state. + +Task 17 — MODIFY frontend/src/pages/visualize/backtest.tsx: + - INSERT + beneath the existing when `main_model_results.bucketed_aggregate_metrics` is present. + - INSERT RMSE column in the existing metric-card row when `aggregate_metrics.rmse` is present. + - PRESERVE the existing baseline-vs-feature-aware comparison logic (or extend it: when `baseline_results` is non-empty, render the comparison view above the single-model view). + - PRESERVE URL-shareable state + the existing model_type Select (replaced by tied to ). + +Task 18 — MODIFY frontend/src/pages/visualize/planner.tsx: + - INSERT a method Badge near the run-id picker: 'model_exogenous' (variant=info) or 'heuristic' (variant=warning) per `ScenarioComparison.method`. + - INSERT a known-future-input vs hypothetical Pill next to each assumption row. + - PRESERVE the multi-scenario chart + save/clone/delete flow. + +Task 19 — MODIFY frontend/src/pages/explorer/run-detail.tsx: + - INSERT beneath the existing run metadata card. + - When PRP-36 ships random_forest: ensure the existing FeatureImportancePanel renders the new 'tree (random_forest)' variant (it already supports `kind=tree`; verify in Task 1). + - PRESERVE the artifact verify section + the existing Explanation/FeatureImportance panels. + +Task 20 — MODIFY frontend/src/pages/explorer/run-compare.tsx: + - INSERT a "Feature frame version" row in the metrics-diff table when at least one of the runs has `feature_frame_version` defined. + - INSERT beneath the picker row. + - PRESERVE the DeltaCell sign-only behaviour. + +Task 21 — MODIFY frontend/src/pages/ops.tsx: + - INSERT the new `feature_frame_version_mismatch` chip handling in the stale-alias table — map the reason via the existing StaleReason switch. + - INSERT degrading-status explanation row beneath each ModelHealthEntry: latest_wape, previous_wape, wape_delta (color-coded), n_comparable_runs, last_trained_at, staleness_days. All these fields ALREADY exist on `ModelHealthEntry` (frontend/src/types/api.ts:830-843); this PRP just surfaces them. + - REPLACE the existing Promote affordance with . + - PRESERVE the OpsSummary + RetrainingCandidates table. + +Task 22 — MODIFY frontend/src/pages/visualize/batch.tsx: + - INSERT at the top of the form. + - INSERT below the preset (the preset prefills the matrix). + - PRESERVE the PRP-34 max_parallel Slider and cancel AlertDialog. + - When user picks a preset, the matrix populates; user can still toggle cells manually. + +Task 23 — MODIFY frontend/src/hooks/use-runs.ts: + - EXTEND the useRuns query-key tuple to include `feature_frame_version` when supplied (additive; backwards-compat). + - When the backend does not support filtering by feature_frame_version (Task 1 ABSENT), the hook accepts the param locally but does NOT forward it to the API — to avoid a 422. + +Task 24 — UPDATE tests: + - feature-frame-utils.test.ts (Task 2). + - horizon-bucket-utils.test.ts (Task 3). + - model-family-tabs.test.tsx; model-type-select.test.tsx; feature-frame-select.test.tsx; feature-groups-toggle.test.tsx (Tasks 5-8). + - horizon-bucket-table.test.tsx (Task 9). + - feature-frame-panel.test.tsx (Task 10). + - champion-compatibility-badge.test.tsx (Task 11). + - promote-confirmation-dialog.test.tsx (Task 12). + - batch-preset-select.test.tsx; batch-matrix-picker.test.tsx (Tasks 13-14). + - backtest-horizon-buckets-chart.test.tsx (Task 15). + - UPDATE forecast.tsx.test? (page-level tests are rare in this repo — colocate component tests; page tests only when there's nontrivial conditional logic in the page itself). + - REGRESSION: confirm feature-importance-panel.test.tsx still green; explanation-panel.test.tsx unchanged; model-family-badge.test.tsx unchanged. + +Task 25 — DOC UPDATE: + - CREATE docs/user-guide/advanced-forecasting-guide.md — user-facing explanation of model families, feature frame V1 vs V2, feature packs, WAPE / RMSE / per-horizon-buckets, stale aliases, safer Promote affordance. Indexable by RAG. + - UPDATE docs/user-guide/dashboard-guide.md — reference the new affordances on each touched page. + - UPDATE docs/_base/API_CONTRACTS.md — only if the BACKEND response shape changed and PRP-36 missed the doc update. + +Task 26 — DOGFOOD (per memory `playwright-dogfood-snap-chromium`): + - Run `pnpm dev` (via `./node_modules/.bin/vite --host 0.0.0.0` per the WSL workaround in CLAUDE.local.md). + - Use the `webapp-testing` skill to exercise the golden paths: + a) Train a V1 baseline → confirm the existing-fields path still works (no regressions). + b) Train a V2 feature-aware run (gated on PRP-35) → confirm feature-groups toggles are visible. + c) Backtest a feature-aware run → confirm horizon-bucket table renders. + d) Open a V2 run in /explorer/run-detail → confirm FeatureFramePanel renders. + e) Open /ops → confirm stale-alias mismatch chip renders if seeded. + f) Open /visualize/batch → confirm preset prefills the matrix. + - Capture screenshots; attach to the PR. + - CHECK `ps -ef | grep uvicorn` BEFORE asserting "it works" (per memory `dogfood-stale-uvicorn-port-8123`). +``` + +### Per task pseudocode (the load-bearing parts) + +```typescript +// Task 2 — feature-frame-utils.ts (key parts) +import type { FeatureMetadataResponse } from '@/types/api'; + +export type FeatureFrameVersion = 1 | 2; +export type FeatureGroup = + | 'target_history' | 'rolling' | 'trend' | 'calendar' | 'price_promo' + | 'inventory' | 'lifecycle' | 'replenishment' | 'returns' + | 'exogenous_weather' | 'exogenous_macro'; + +const LABELS: Record = { + target_history: 'Target history (lags + same-DOW mean)', + rolling: 'Rolling means', + trend: 'Trend (30/90-day)', + calendar: 'Calendar (DOW, month, sin/cos)', + price_promo: 'Price + promotion', + inventory: 'Inventory + stockout', + lifecycle: 'Product lifecycle', + replenishment: 'Replenishment cadence', + returns: 'Returns intensity', + exogenous_weather: 'Weather signals', + exogenous_macro: 'Macro signals', +}; + +export function labelForGroup(group: FeatureGroup): string { + return LABELS[group]; +} + +export function isV2Available(meta: FeatureMetadataResponse | undefined): boolean { + if (!meta) return false; + if (meta.feature_frame_version === 2) return true; + if (meta.feature_groups && Object.keys(meta.feature_groups).length > 0) return true; + return false; +} + +export function defaultV2Groups(): FeatureGroup[] { + return ['target_history','rolling','trend','calendar','price_promo','lifecycle']; +} + +export function safetyClassChipVariant(safety: 'safe' | 'conditionally_safe' | 'unsafe_unless_supplied') { + switch (safety) { + case 'safe': return 'success' as const; + case 'conditionally_safe': return 'warning' as const; + case 'unsafe_unless_supplied': return 'error' as const; + } +} + + +// Task 11 — champion-compatibility-badge.tsx (key parts) +function computeCompatibility(a: ModelRun, b: ModelRun): { ok: boolean; reason?: string } { + if (a.store_id !== b.store_id || a.product_id !== b.product_id) { + return { ok: false, reason: 'Different grain (store + product)' }; + } + const a_start = new Date(a.data_window_start).getTime(); + const a_end = new Date(a.data_window_end).getTime(); + const b_start = new Date(b.data_window_start).getTime(); + const b_end = new Date(b.data_window_end).getTime(); + if (a_end < b_start || b_end < a_start) { + return { ok: false, reason: 'No data-window overlap' }; + } + const va = a.feature_frame_version ?? 1; + const vb = b.feature_frame_version ?? 1; + if (va !== vb) { + return { ok: false, reason: `Different feature frame version (V${va} vs V${vb})` }; + } + return { ok: true }; +} + + +// Task 12 — promote-confirmation-dialog.tsx (key parts) +function PromoteConfirmationDialog({ open, onOpenChange, run, currentChampion, onConfirm, isPromoting }: Props) { + const [worseAcknowledged, setWorseAcknowledged] = useState(false); + const [versionMismatchAcknowledged, setVersionMismatchAcknowledged] = useState(false); + const { data: verify } = useArtifactVerify(run.run_id, open); // existing hook + + const worseWape = + currentChampion?.metrics?.wape != null && + run.metrics?.wape != null && + run.metrics.wape > currentChampion.metrics.wape; + + const verifyFailed = verify?.verified === false; + + const versionMismatch = + (currentChampion?.feature_frame_version ?? 1) !== (run.feature_frame_version ?? 1); + + const canConfirm = + !verifyFailed && + (!worseWape || worseAcknowledged) && + (!versionMismatch || versionMismatchAcknowledged) && + !isPromoting; + + // … AlertDialog body renders each callout + checkbox … +} +``` + +### Integration Points + +```yaml +BACKEND: + - No backend changes. Every new UI field reads an EXISTING backend response field that PRP-35 / PRP-36 add. Slice C does NOT ship backend code. + - Task 1 (Contract Probe) is the only "backend" interaction; it's a read-only schema audit. + +FRONTEND ROUTES: + - No new routes. Top-nav unchanged. + +FRONTEND HOOKS: + - use-runs.ts: query-key tuple gets an optional `feature_frame_version` filter (passthrough when supported). + - All other hooks unchanged in shape; they consume new Optional fields. + +CONFIG: + - No new VITE_* env vars. No `.env.example` change in frontend/. + +TESTING: + - vitest config unchanged. New `*.test.{ts,tsx}` files colocated next to source. + +CHANGELOG: + - Under "Unreleased": `feat(ui): forecast intelligence C — operator workflow surfaces for V2 features + model zoo + per-horizon metrics (#)`. +``` + +--- + +## Validation Loop + +### Level 1: Frontend syntax + types + lint + +```bash +cd frontend +pnpm tsc --noEmit # strict TypeScript +pnpm lint # ESLint clean + +# shadcn import guards +grep -rn "from 'radix-ui'" src && echo "FAIL: barrel import found" && exit 1 +grep -rn 'from "radix-ui"' src && echo "FAIL: barrel import found" && exit 1 +echo "OK: per-component radix imports only" + +# Expected: zero errors. Fix every reported issue; do not silence via @ts-ignore. +``` + +### Level 2: Unit tests + +```bash +cd frontend +pnpm test --run + +# Expected: every new test green; every existing test still green. +# If a snapshot file exists, only update it when the change is deliberate. +``` + +### Level 3: Backend regression (sanity check — this PRP touches no backend code) + +```bash +# Run from repo root +uv run pytest -v -m "not integration" \ + app/features/forecasting/tests \ + app/features/backtesting/tests \ + app/features/registry/tests \ + app/features/ops/tests + +# Expected: unchanged from pre-PR baseline. If anything changes, you +# accidentally touched backend code — back it out. +``` + +### Level 4: Dogfood the running UI + +```bash +# WSL workaround per CLAUDE.local.md +cd frontend && ./node_modules/.bin/vite --host 0.0.0.0 + +# In a separate shell: +ps -ef | grep '[u]vicorn' # verify backend is the current-session process +curl -s http://localhost:8123/health # should print {"status":"ok"} + +# Use the webapp-testing skill to exercise (no manual flow in this PRP — +# the skill is the orchestration; capture screenshots for the PR). +``` + +--- + +## Final validation Checklist + +> **GATE FIRST:** Task 1 produced a written contract-probe report. Every +> task tagged `[gate:PRP-35]` or `[gate:PRP-36]` has been verified +> against the live backend OR explicitly deferred with a note pointing +> at the absent field. + +- [ ] Task 1 (Contract Probe) report committed under `PRPs/ai_docs/contract-probe-report.md`. +- [ ] Every Optional field added to `frontend/src/types/api.ts` corresponds to a present backend field per Task 1. +- [ ] `pnpm tsc --noEmit` clean. +- [ ] `pnpm lint` clean. +- [ ] `pnpm test --run` clean. +- [ ] No `from 'radix-ui'` barrel imports introduced. +- [ ] No hand-rolled `components/ui/*` file where the shadcn registry has an equivalent component. +- [ ] `shadcn@4.7.0` was used for every new shadcn install (memory `shadcn-cli-version-pin`). +- [ ] URL-shareable state preserved on every page that has it today. +- [ ] `/visualize/forecast`: family Tabs + model-type Select + feature-frame Select + conditional feature-groups Toggles render; submit produces a valid TrainRequest. +- [ ] `/visualize/backtest`: RMSE column appears when present; horizon-bucket table + chart render when present; baseline-vs-feature-aware comparison renders when both present; empty states cover every absent field. +- [ ] `/visualize/planner`: method badge + known-future-input pills present. +- [ ] `/visualize/batch`: 5 presets prefill the matrix; matrix-picker emits a valid BatchSubmitRequest. +- [ ] `/explorer/run-detail`: Feature frame panel renders V1/V2 + groups + safety; empty-state for pre-PRP-35 runs. +- [ ] `/explorer/run-compare`: Feature frame version row + ChampionCompatibilityBadge per the comparable-run rule. +- [ ] `/ops`: feature_frame_version_mismatch chip handled; degrading-status fields surfaced; PromoteConfirmationDialog blocks worse-WAPE without acknowledgement, blocks verify-fail, requires V-mismatch acknowledgement. +- [ ] Every conditional-rendering branch has a Vitest test (missing feature_frame_version, missing feature_groups, HGBR 422, random_forest tree-importance, stale-with-worse-WAPE, artifact-fail, V-mismatch). +- [ ] No backend code touched in this PRP (`git diff app/` and `git diff alembic/` empty). +- [ ] No new agent tool; `agent_require_approval` unchanged. +- [ ] No new VITE_* env vars; no `.env.example` change. +- [ ] Documentation (advanced-forecasting-guide.md) created and indexed; dashboard-guide.md updated. +- [ ] Dogfood (Level 4) screenshots attached to the PR. +- [ ] CHANGELOG entry under "Unreleased": `feat(ui): forecast intelligence C — operator workflow surfaces for V2 features + model zoo + per-horizon metrics (#)`. + +--- + +## Unresolved Contract Assumptions (waiting on PRP-35 + PRP-36 execution) + +Each assumption is verified by Task 1 (Contract Probe). If verification +fails for an item, the corresponding UI task is DEFERRED — implementer +patches THIS PRP file to mark the task `DEFERRED — pending {field}` and +proceeds with the rest. + +1. PRP-35 ships `TrainRequest.feature_frame_version: int = 1` and + `TrainRequest.feature_groups: list[str] | None`. UI Tasks 7 + 8 + 16 + depend on this. ASSUMPTION: when V1, `feature_groups` is rejected + (422) by the backend per the post-patch wording in PRP-35. +2. PRP-35 ships `FeatureGroup` enum with the exact 11 values listed in + `lib/feature-frame-utils.ts`. Task 1 verifies value-by-value. +3. PRP-35 ships `FeatureMetadataResponse.feature_frame_version`, + `feature_groups`, `feature_safety_classes`. Tasks 10 + 19 depend. +4. PRP-36 ships `BacktestResponse.main_model_results.aggregate_metrics.rmse`, + `bucketed_aggregate_metrics`, and `FoldResult.horizon_bucket_metrics`. + Tasks 9 + 15 + 17 depend. +5. PRP-36 ships `StaleReason.FEATURE_FRAME_VERSION_MISMATCH` AND + `StaleAliasResponse.alias_feature_frame_version` + + `comparable_run_feature_frame_version`. Tasks 11 + 21 depend. +6. PRP-36 ships `RunResponse.feature_frame_version` + + `feature_groups`. Tasks 10 + 19 + 20 depend. +7. PRP-36 ships new model_type values (`weighted_moving_average`, + `seasonal_average`, optionally `trend_regression_baseline` and + `random_forest`). Task 6 + Task 13 depend. If a value is ABSENT, the + model-type Select hides it AND the corresponding preset omits it. +8. PRP-36 keeps `FeatureImportanceUnavailableError` 422 path intact + for HGBR. Feature-importance-panel.tsx already handles this; this + PRP must NOT weaken it. +9. Backend rejects `feature_groups` when `feature_frame_version=1`. + Slice C MUST NOT send `feature_groups: []` when V1 is selected — + send `feature_groups: undefined` (i.e. omit the field). +10. `ScenarioComparison.method` is `'heuristic' | 'model_exogenous'` + (no other values). Task 18 depends. If a future PRP adds a third + method, this PRP's badge defaults to neutral. + +--- + +## Anti-Patterns to Avoid + +- ❌ Don't render a value the backend did not return. The + Feature-frame panel's empty state is the contract for absent fields. +- ❌ Don't bypass `.claude/rules/shadcn-ui.md`. Every shadcn component + arrives through `pnpm dlx shadcn@4.7.0 add …` from `frontend/`. + No raw GitHub fetches; no copying a published component into + `components/ui/*` manually. +- ❌ Don't introduce `from 'radix-ui'` (the barrel shadcn 5.x writes). + Per-component `@radix-ui/react-X` only. +- ❌ Don't add `permutation_importance` calls in the UI — that's a + separate PRP (the existing 422 path is the operator-facing contract). +- ❌ Don't fake a feature_frame_version on a run that doesn't carry + one — render the empty state. +- ❌ Don't downgrade the FeatureImportancePanel's existing 422 + (HGBR-unavailable) UX. The "use lightgbm/xgboost for native + importances" message is the contract; preserve it. +- ❌ Don't send `feature_groups: []` to the backend when V1 is selected. + Omit the field entirely. +- ❌ Don't introduce a new agent tool — `agent_require_approval` + unchanged. The "Use this context" copy buttons are pure DOM/ + clipboard-API; they do NOT call the agent layer. +- ❌ Don't compare runs across `feature_frame_version` in the + ChampionCompatibilityBadge — incompatibility is the explicit signal. +- ❌ Don't widen the agent layer / backend / Alembic; this PRP touches + the frontend only. +- ❌ Don't promote a worse run without explicit checkbox acknowledgement + in the PromoteConfirmationDialog. +- ❌ Don't introduce a SegmentedControl component — Tabs styled as + segmented is the project pattern. +- ❌ Don't trust `:8123` without checking `ps -ef | grep uvicorn` first + (memory `dogfood-stale-uvicorn-port-8123`). + +--- + +## Confidence + +**Confidence: 6.5/10** for one-pass implementation success after PRP-35 ++ PRP-36 land. + +What grounds the 6.5: +- The frontend codebase research is anchored at file:line for every + hook + page + component + chart this PRP touches. +- Every new component is colocated under `frontend/src/components/ + forecast-intelligence/` so the review surface is cohesive. +- The shadcn workflow is explicitly invoked (skill + MCP + 4.7.0 pin). +- Every conditional-rendering branch has a Vitest test path called out. +- The "do not fabricate backend values" rule has a single enforcement + point (Task 1's contract-probe report), and every dependent task is + tagged with its gate. + +What costs the 3.5 points: +- **Two prior PRPs have not landed yet.** Even with Task 1, the UI + surface area is wide; a late field-name change in PRP-35 or PRP-36 + rippling into the type extensions can require multiple cross-file + edits. Mitigation: every new field is Optional and read defensively. +- Dogfood depends on a live backend with V2-aware runs seeded. The + current dev DB has 49 model_runs + 12 aliases (per HANDOFF.md), but + none are V2 — PRP-35's execution session needs to create at least one + V2 SUCCESS run before Slice C's dogfood is meaningful. +- shadcn 5.x has known regressions (memories `shadcn-cli-version-pin`, + `radix-ui-vs-per-component-imports`); the 4.7.0 pin must hold + through this PRP's life. CI does not gate shadcn version drift today; + the implementer enforces it manually. +- Recharts 2.x + Tailwind 4 + React 19 is a fresh combination — the + existing charts work, but new charts may surface visual regressions + on small terminals. Dogfood at 1024px and 1440px both.