Skip to content

Commit 2091f2f

Browse files
authored
Merge pull request #251 from w7-mgfcode/feat/forecasting-xgboost-model
feat(forecast): add XGBoost feature-aware forecasting model (#247)
2 parents 82c457e + ca4dd4b commit 2091f2f

25 files changed

Lines changed: 1790 additions & 11 deletions

PRPs/INITIAL/INITIAL-MLZOO-C-xgboost-prophet-extensions.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
11
# INITIAL-MLZOO-C-xgboost-prophet-extensions.md - XGBoost and Prophet-like Extensions
22

3+
> **This brief is split into TWO PRPs — two branches, two review units. Never one.**
4+
> This INITIAL is the shared brief for both, but the two models are delivered separately:
5+
>
6+
> - **`PRPs/PRP-MLZOO-C1-xgboost-model.md`** — the XGBoost half. A low-risk follow-up that
7+
> mirrors the merged `LightGBMForecaster` design (optional `ml-xgboost` extra, feature
8+
> flag, lazy import, deterministic training, registry metadata).
9+
> - **`PRPs/PRP-MLZOO-C2-prophet-like-additive-model.md`** — the Prophet-like half. A
10+
> distinct model-family design task — a pure-scikit-learn additive linear model with
11+
> trend / seasonality / holiday-regressor decomposition; **not** a clone of the tree
12+
> models and **not** the real `prophet` dependency.
13+
>
14+
> Do not combine the two models into a single PRP or a single branch. The "Out of scope"
15+
> lists below still apply to *each* PRP individually (e.g. C1 does not touch Prophet-like
16+
> work; C2 does not touch XGBoost). See `INITIAL-MLZOO-index.md` for the updated roadmap.
17+
318
## FEATURE:
419

520
Extend the Advanced ML Model Zoo after the feature-frame foundation and first advanced model path are stable.

PRPs/INITIAL/INITIAL-MLZOO-index.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,16 +18,24 @@ Recommended PRP sequence:
1818
| 1 | `INITIAL-MLZOO-A-foundation-feature-frames.md` | PRP-29 | Feature-aware forecasting foundation and leakage-safe frame contracts |
1919
| 2 | `INITIAL-MLZOO-B-lightgbm-first-model.md` | PRP-30 | First advanced model path with LightGBM (optional `ml-lightgbm` extra) |
2020
| 2.5 | `INITIAL-MLZOO-B.2-feature-aware-backtesting.md` | PRP-MLZOO-B.2 | Wire feature-aware models into the backtesting fold loop (per-fold leakage-safe `X_train` / `X_future`) |
21-
| 3 | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` | Future PRP | XGBoost and Prophet-like extensions |
21+
| 3a | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` (XGBoost half) | PRP-MLZOO-C1 | XGBoost feature-aware model — a low-risk follow-up mirroring the merged LightGBM design (optional `ml-xgboost` extra) |
22+
| 3b | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` (Prophet-like half) | PRP-MLZOO-C2 | Prophet-like additive model — a distinct model-family design (pure scikit-learn; trend / seasonality / regressor decomposition) |
2223
| 4 | `INITIAL-MLZOO-D-frontend-registry-explainability.md` | Future PRP | UI, registry surfacing, and explanation polish |
2324

25+
**C is two PRPs, not one.** `INITIAL-MLZOO-C` briefs both XGBoost and a Prophet-like model,
26+
but they are deliberately split into **two separate PRPs, branches, and review units**
27+
`PRP-MLZOO-C1` (XGBoost) and `PRP-MLZOO-C2` (Prophet-like). They are additive and
28+
order-independent; whichever merges second rebases cleanly. Do **not** combine them into a
29+
single branch or a single review unit (this honours the "one reviewable unit" rule below).
30+
2431
Dependency graph:
2532

2633
```text
2734
A. Foundation feature frames
2835
-> B. LightGBM first model
2936
-> B.2 Feature-aware backtesting
30-
-> C. XGBoost / Prophet-like extensions
37+
-> C1. XGBoost model (separate review unit)
38+
-> C2. Prophet-like model (separate review unit; parallel to C1)
3139
-> D. Frontend / registry / explainability
3240
```
3341

@@ -74,5 +82,7 @@ Read these before creating any MLZOO PRP:
7482
- Do not implement LightGBM before the feature-frame contracts and leakage tests are stable.
7583
- Do not implement XGBoost or Prophet-like models before the first advanced model path proves the architecture.
7684
- Do not add frontend/explainability scope before backend metadata and persistence contracts are stable.
77-
- Keep each PRP to one branch and one reviewable unit.
85+
- Keep each PRP to one branch and one reviewable unit. In particular, `INITIAL-MLZOO-C`'s
86+
two models (XGBoost, Prophet-like) are **two PRPs**`PRP-MLZOO-C1` and `PRP-MLZOO-C2`
87+
never one combined branch.
7888

PRPs/PRP-MLZOO-C1-xgboost-model.md

Lines changed: 979 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,9 @@ docker-compose up -d
4747
```bash
4848
uv sync --extra dev
4949
# or: pip install -e ".[dev]"
50-
# LightGBM is an opt-in advanced model — add the extra to enable it:
50+
# LightGBM and XGBoost are opt-in advanced models — add the extra to enable each:
5151
# uv sync --extra dev --extra ml-lightgbm (then set forecast_enable_lightgbm=true)
52+
# uv sync --extra dev --extra ml-xgboost (then set forecast_enable_xgboost=true)
5253
```
5354

5455
4. **Run database migrations**
@@ -342,6 +343,7 @@ curl -X POST http://localhost:8123/forecasting/predict \
342343
- `moving_average` - Mean of last N observations
343344
- `regression` - Gradient-boosted exogenous-feature regressor (feature-aware)
344345
- `lightgbm` - LightGBM feature-aware regressor — opt-in: install the `ml-lightgbm` extra and set `forecast_enable_lightgbm=True`
346+
- `xgboost` - XGBoost feature-aware regressor — opt-in: install the `ml-xgboost` extra and set `forecast_enable_xgboost=True`
345347

346348
See [examples/models/](examples/models/) for baseline model examples.
347349

@@ -394,7 +396,7 @@ curl -X POST http://localhost:8123/backtesting/run \
394396
When `include_baselines=true`, automatically compares against naive and seasonal_naive models.
395397

396398
**Feature-Aware Models:**
397-
`regression` and `lightgbm` models can be backtested too — set
399+
`regression`, `lightgbm`, and `xgboost` models can be backtested too — set
398400
`model_config_main.model_type` accordingly. Each fold builds a leakage-safe
399401
per-fold feature matrix (`min_train_size >= 30` required); the result carries
400402
`feature_aware: true` and `exogenous_policy: "observed"`.

app/core/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ class Settings(BaseSettings):
9999
forecast_max_horizon: int = 90
100100
forecast_model_artifacts_dir: str = "./artifacts/models"
101101
forecast_enable_lightgbm: bool = False
102+
forecast_enable_xgboost: bool = False
102103

103104
# Backtesting
104105
backtest_max_splits: int = 20

app/features/backtesting/tests/test_feature_aware_backtest.py

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,11 @@
2525
SeriesData,
2626
)
2727
from app.features.backtesting.splitter import TimeSeriesSplitter
28-
from app.features.forecasting.schemas import NaiveModelConfig, RegressionModelConfig
28+
from app.features.forecasting.schemas import (
29+
NaiveModelConfig,
30+
RegressionModelConfig,
31+
XGBoostModelConfig,
32+
)
2933
from app.shared.feature_frames import canonical_feature_columns
3034

3135
_N_FEATURES = len(canonical_feature_columns()) # 14 — 4 lags + 6 calendar + 4 exogenous
@@ -135,6 +139,44 @@ def test_feature_aware_backtest_produces_per_fold_metrics(
135139
assert "mae" in fold.metrics
136140

137141

142+
def test_feature_aware_backtest_runs_with_xgboost_model(
143+
sample_dates_120: list[date],
144+
sample_values_120: np.ndarray,
145+
sample_split_config_expanding: SplitConfig,
146+
monkeypatch: pytest.MonkeyPatch,
147+
) -> None:
148+
"""An XGBoost backtest runs end-to-end and yields per-fold metrics.
149+
150+
Mirrors ``test_feature_aware_backtest_produces_per_fold_metrics`` for the
151+
XGBoost feature-aware model (PRP-MLZOO-C1) — proving the B.2
152+
``requires_features`` probe needs no per-model backtesting-service wiring.
153+
SKIPs when the optional ``ml-xgboost`` dependency is absent; the
154+
``forecast_enable_xgboost`` flag is enabled so ``model_factory`` dispatches.
155+
"""
156+
pytest.importorskip("xgboost")
157+
from app.core.config import get_settings
158+
159+
monkeypatch.setattr(get_settings(), "forecast_enable_xgboost", True)
160+
161+
service = BacktestingService()
162+
series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
163+
splitter = TimeSeriesSplitter(sample_split_config_expanding)
164+
165+
result = service._run_model_backtest(
166+
series_data=series,
167+
splitter=splitter,
168+
model_config=XGBoostModelConfig(),
169+
store_fold_details=True,
170+
)
171+
172+
assert result.model_type == "xgboost"
173+
assert result.feature_aware is True
174+
assert len(result.fold_results) > 0
175+
assert "mae" in result.aggregated_metrics
176+
for fold in result.fold_results:
177+
assert "mae" in fold.metrics
178+
179+
138180
def test_feature_aware_result_records_observed_policy(
139181
sample_dates_120: list[date],
140182
sample_values_120: np.ndarray,

app/features/forecasting/models.py

Lines changed: 174 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -732,8 +732,166 @@ def set_params(self, **params: Any) -> LightGBMForecaster: # noqa: ANN401
732732
return self
733733

734734

735+
class XGBoostForecaster(BaseForecaster):
736+
"""Feature-aware forecaster wrapping ``xgboost.XGBRegressor``.
737+
738+
The second ADVANCED feature-aware tree model (MLZOO-C1). Structurally a
739+
twin of ``LightGBMForecaster``: it REQUIRES a non-``None`` exogenous ``X``
740+
for both ``fit`` and ``predict``; the estimator is gradient-boosted trees
741+
from the optional ``xgboost`` package.
742+
743+
``xgboost`` is imported LAZILY inside ``fit`` — never at module scope and
744+
never in ``__init__`` — so importing this module (which every forecasting
745+
code path does, baseline models included) never requires the optional
746+
``ml-xgboost`` dependency.
747+
748+
Determinism: ``XGBRegressor`` has no ``deterministic`` switch (unlike
749+
LightGBM). Bit-reproducibility comes from ``n_jobs=1`` + ``tree_method="hist"``
750+
+ a fixed ``random_state`` + the conservative config leaving ``subsample`` /
751+
``colsample_bytree`` at their ``1.0`` defaults (no stochastic sampling) —
752+
all pinned in ``fit``. XGBoost tolerates ``NaN`` natively (``missing=np.nan``),
753+
which matters because the future feature frame leaves lag cells ``NaN``
754+
when their source target lies in the un-observed horizon.
755+
756+
Attributes:
757+
n_estimators: Number of boosting rounds.
758+
learning_rate: Gradient-boosting learning rate.
759+
max_depth: Maximum depth of each tree.
760+
"""
761+
762+
requires_features: ClassVar[bool] = True
763+
"""A feature-aware model — ``fit``/``predict`` REQUIRE a non-None ``X``."""
764+
765+
def __init__(
766+
self,
767+
*,
768+
n_estimators: int = 100,
769+
learning_rate: float = 0.1,
770+
max_depth: int = 6,
771+
random_state: int = 42,
772+
) -> None:
773+
"""Initialize the XGBoost forecaster.
774+
775+
Args:
776+
n_estimators: Number of boosting rounds.
777+
learning_rate: Gradient-boosting learning rate.
778+
max_depth: Maximum depth of each tree.
779+
random_state: Random seed for reproducibility (determinism).
780+
"""
781+
super().__init__(random_state)
782+
self.n_estimators = n_estimators
783+
self.learning_rate = learning_rate
784+
self.max_depth = max_depth
785+
self._estimator: Any = None
786+
787+
def fit(
788+
self,
789+
y: np.ndarray[Any, np.dtype[np.floating[Any]]],
790+
X: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None,
791+
) -> XGBoostForecaster:
792+
"""Fit the gradient-boosted regressor on historical features.
793+
794+
Args:
795+
y: Target values (1D array of shape ``[n_samples]``).
796+
X: Exogenous features (2D array of shape ``[n_samples, n_features]``).
797+
REQUIRED — unlike the baseline forecasters.
798+
799+
Returns:
800+
self (for method chaining).
801+
802+
Raises:
803+
ValueError: If ``X`` is ``None``, ``y`` is empty, or the row counts
804+
of ``X`` and ``y`` do not match.
805+
"""
806+
if X is None:
807+
raise ValueError("XGBoostForecaster requires exogenous features X for fit()")
808+
if len(y) == 0:
809+
raise ValueError("Cannot fit on empty array")
810+
if X.shape[0] != len(y):
811+
raise ValueError(
812+
f"X has {X.shape[0]} rows but y has {len(y)} — feature/target rows must match"
813+
)
814+
# LAZY import — the optional ``ml-xgboost`` dependency is only needed
815+
# the first time an XGBoost model is actually fitted.
816+
import xgboost as xgb
817+
818+
estimator: Any = xgb.XGBRegressor(
819+
n_estimators=self.n_estimators,
820+
learning_rate=self.learning_rate,
821+
max_depth=self.max_depth,
822+
random_state=self.random_state,
823+
n_jobs=1, # single-threaded — removes float-summation non-determinism
824+
tree_method="hist", # explicit; the default, and the reproducible path
825+
verbosity=0, # silence XGBoost's training chatter
826+
)
827+
estimator.fit(X, y)
828+
self._estimator = estimator
829+
self._last_values = np.asarray(y[-1:], dtype=np.float64)
830+
self._is_fitted = True
831+
return self
832+
833+
def predict(
834+
self,
835+
horizon: int,
836+
X: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None,
837+
) -> np.ndarray[Any, np.dtype[np.floating[Any]]]:
838+
"""Generate forecasts from a future feature frame.
839+
840+
Args:
841+
horizon: Number of steps to forecast.
842+
X: Exogenous features for the forecast period, shape
843+
``[horizon, n_features]``. REQUIRED.
844+
845+
Returns:
846+
Array of forecasts with shape ``[horizon]``.
847+
848+
Raises:
849+
RuntimeError: If the model has not been fitted.
850+
ValueError: If ``X`` is ``None`` or its row count is not ``horizon``.
851+
"""
852+
if not self._is_fitted or self._estimator is None:
853+
raise RuntimeError("Model must be fitted before predict")
854+
if X is None:
855+
raise ValueError("XGBoostForecaster requires exogenous features X for predict()")
856+
if X.shape[0] != horizon:
857+
raise ValueError(f"X has {X.shape[0]} rows but horizon is {horizon} — they must match")
858+
predictions = self._estimator.predict(X)
859+
result: np.ndarray[Any, np.dtype[np.floating[Any]]] = np.asarray(
860+
predictions, dtype=np.float64
861+
)
862+
return result
863+
864+
def get_params(self) -> dict[str, Any]:
865+
"""Get model parameters.
866+
867+
Returns:
868+
Dictionary with n_estimators, learning_rate, max_depth, random_state.
869+
"""
870+
return {
871+
"n_estimators": self.n_estimators,
872+
"learning_rate": self.learning_rate,
873+
"max_depth": self.max_depth,
874+
"random_state": self.random_state,
875+
}
876+
877+
def set_params(self, **params: Any) -> XGBoostForecaster: # noqa: ANN401
878+
"""Set model parameters.
879+
880+
Args:
881+
**params: Parameter names and values to set.
882+
883+
Returns:
884+
self (for method chaining).
885+
"""
886+
for key, value in params.items():
887+
setattr(self, key, value)
888+
return self
889+
890+
735891
# Type alias for model type literals
736-
ModelType = Literal["naive", "seasonal_naive", "moving_average", "lightgbm", "regression"]
892+
ModelType = Literal[
893+
"naive", "seasonal_naive", "moving_average", "xgboost", "lightgbm", "regression"
894+
]
737895

738896

739897
def model_factory(config: ModelConfig, random_state: int = 42) -> BaseForecaster:
@@ -790,6 +948,21 @@ def model_factory(config: ModelConfig, random_state: int = 42) -> BaseForecaster
790948
random_state=random_state,
791949
)
792950
raise ValueError("Invalid config type for lightgbm")
951+
elif model_type == "xgboost":
952+
if not settings.forecast_enable_xgboost:
953+
raise ValueError(
954+
"XGBoost is not enabled. Set forecast_enable_xgboost=True in settings."
955+
)
956+
from app.features.forecasting.schemas import XGBoostModelConfig
957+
958+
if isinstance(config, XGBoostModelConfig):
959+
return XGBoostForecaster(
960+
n_estimators=config.n_estimators,
961+
learning_rate=config.learning_rate,
962+
max_depth=config.max_depth,
963+
random_state=random_state,
964+
)
965+
raise ValueError("Invalid config type for xgboost")
793966
elif model_type == "regression":
794967
from app.features.forecasting.schemas import RegressionModelConfig
795968

app/features/forecasting/persistence.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ class ModelBundle:
4242
sklearn_version: Scikit-learn version used when saving.
4343
lightgbm_version: LightGBM version used when saving, ``None`` when the
4444
optional ``ml-lightgbm`` dependency was not installed.
45+
xgboost_version: XGBoost version used when saving, ``None`` when the
46+
optional ``ml-xgboost`` dependency was not installed.
4547
bundle_hash: Deterministic hash of bundle contents.
4648
"""
4749

@@ -54,6 +56,7 @@ class ModelBundle:
5456
python_version: str | None = None
5557
sklearn_version: str | None = None
5658
lightgbm_version: str | None = None
59+
xgboost_version: str | None = None
5760
bundle_hash: str | None = None
5861

5962
def compute_hash(self) -> str:
@@ -106,6 +109,14 @@ def save_model_bundle(bundle: ModelBundle, path: str | Path) -> Path:
106109
bundle.lightgbm_version = str(lightgbm.__version__)
107110
except ImportError:
108111
bundle.lightgbm_version = None
112+
# Best-effort: XGBoost is an optional dependency, so a baseline-only
113+
# install legitimately has no version to record.
114+
try:
115+
import xgboost
116+
117+
bundle.xgboost_version = str(xgboost.__version__)
118+
except ImportError:
119+
bundle.xgboost_version = None
109120
bundle.bundle_hash = bundle.compute_hash()
110121

111122
# Save with compression
@@ -198,6 +209,22 @@ def load_model_bundle(path: str | Path, base_dir: str | Path | None = None) -> M
198209
current_lightgbm=current_lightgbm,
199210
)
200211

212+
# XGBoost is optional — only warn when the bundle recorded a version AND
213+
# the optional dependency is importable here AND the two differ.
214+
if bundle.xgboost_version:
215+
try:
216+
import xgboost
217+
218+
current_xgboost: str | None = str(xgboost.__version__)
219+
except ImportError:
220+
current_xgboost = None
221+
if current_xgboost is not None and bundle.xgboost_version != current_xgboost:
222+
logger.warning(
223+
"forecasting.xgboost_version_mismatch",
224+
saved_xgboost=bundle.xgboost_version,
225+
current_xgboost=current_xgboost,
226+
)
227+
201228
logger.info(
202229
"forecasting.model_bundle_loaded",
203230
path=str(path),

0 commit comments

Comments
 (0)