diff --git a/docs/assets/bayesian_calibration.png b/docs/assets/bayesian_calibration.png new file mode 100644 index 0000000..1601f98 Binary files /dev/null and b/docs/assets/bayesian_calibration.png differ diff --git a/docs/assets/bayesian_crps.png b/docs/assets/bayesian_crps.png new file mode 100644 index 0000000..3bb02a4 Binary files /dev/null and b/docs/assets/bayesian_crps.png differ diff --git a/docs/assets/bayesian_front.png b/docs/assets/bayesian_front.png new file mode 100644 index 0000000..fb0cda9 Binary files /dev/null and b/docs/assets/bayesian_front.png differ diff --git a/docs/assets/bayesian_parallel.png b/docs/assets/bayesian_parallel.png new file mode 100644 index 0000000..5acb7ed Binary files /dev/null and b/docs/assets/bayesian_parallel.png differ diff --git a/docs/guide/bayesian.md b/docs/guide/bayesian.md new file mode 100644 index 0000000..6725f19 --- /dev/null +++ b/docs/guide/bayesian.md @@ -0,0 +1,200 @@ +# Bayesian Model Criticism + +This tutorial walks through a **scoring-and-stacking** workflow for +Bayesian models. It demonstrates the features that set `trade-study` +apart from conventional hyperparameter optimizers: proper scoring +rules, calibration assessment, and score-based model averaging. + +The underlying model is a conjugate Bayesian linear regression — no +MCMC sampler or external probabilistic programming library needed, +only numpy. + +!!! tip "Run it yourself" + + The full runnable script is at + [`examples/bayesian_study.py`](https://github.com/jcm-sci/trade-study/blob/main/examples/bayesian_study.py). + + ```bash + uv run --extra examples python examples/bayesian_study.py + ``` + +## The problem + +We want to fit a Bayesian regression +$y = a + bx + \varepsilon$ +where $\varepsilon \sim \mathcal{N}(0,\sigma_\text{true}^2)$ and the +true coefficients are known ($a=2$, $b=3$, $\sigma_\text{true}=0.5$). + +The design question is: **which prior hyperparameters and sample sizes +produce well-calibrated, accurate posteriors?** Specifically, each +configuration controls three knobs: + +| Factor | Meaning | +|--------|---------| +| `prior_var` | Variance of the Normal prior on $(a, b)$ | +| `noise_scale` | Assumed observation noise $\sigma$ (may disagree with truth) | +| `n_obs` | Number of training observations | + +Choosing `noise_scale` too small produces overconfident posteriors; +too large gives conservative intervals. A tight prior (`prior_var` +small) biases the coefficients toward zero. More data always helps, +but at a computational cost. + +## Ground-truth model + +```python +--8<-- "examples/bayesian_study.py:model" +``` + +## Conjugate posterior + +Because we use a Normal prior with known noise variance, the +posterior over regression coefficients is available in closed form. +Given design matrix $\mathbf{X}$ and prior precision +$\boldsymbol{\Lambda}_0 = \sigma_\text{prior}^{-2}\mathbf{I}$: + +$$ +\boldsymbol{\Sigma}_n + = \left(\boldsymbol{\Lambda}_0 + + \frac{\mathbf{X}^\top\mathbf{X}}{\sigma^2}\right)^{-1} +\qquad +\boldsymbol{\mu}_n + = \boldsymbol{\Sigma}_n + \frac{\mathbf{X}^\top\mathbf{y}}{\sigma^2} +$$ + +We draw posterior predictive samples by projecting coefficient draws +through the test design matrix and adding observation noise. + +```python +--8<-- "examples/bayesian_study.py:posterior" +``` + +## Simulator and scorer + +The **simulator** generates training data from the true model, fits +the conjugate posterior, and returns posterior predictive samples at +held-out test locations. The **scorer** evaluates three complementary +metrics: + +- **CRPS** (Continuous Ranked Probability Score) — a proper scoring + rule that rewards sharp, well-calibrated predictive distributions. +- **Coverage (95 %)** — the fraction of test points whose true value + falls inside the 95 % posterior predictive interval. +- **RMSE** — root mean squared error of the posterior mean. + +```python +--8<-- "examples/bayesian_study.py:world" +``` + +## Observables and factors + +Coverage is given lower weight (0.5) because a model can trivially +achieve 100 % coverage by being very wide; CRPS already penalises +that. + +```python +--8<-- "examples/bayesian_study.py:observables" +``` + +```python +--8<-- "examples/bayesian_study.py:factors" +``` + +## Annotations + +An `Annotation` attaches external metadata — here, a simple linear +cost model for the number of observations: + +```python +--8<-- "examples/bayesian_study.py:annotation" +``` + +## Factor screening + +Before running the full grid, Morris screening identifies which +factors influence the scores most. `reduce_factors` drops any +factor whose $\mu^*$ falls below a threshold: + +```python +--8<-- "examples/bayesian_study.py:screening" +``` + +## Grid evaluation + +A 60-point Latin hypercube covers the three-factor space. `run_grid` +evaluates every configuration and collects scores and annotations into +a `ResultsTable`: + +```python +--8<-- "examples/bayesian_study.py:run" +``` + +## Pareto analysis + +```python +--8<-- "examples/bayesian_study.py:results" +``` + +### Pareto front scatter matrix + +![Pareto front](../assets/bayesian_front.png) + +With three objectives the front is a surface; `plot_front` shows all +pairwise projections. Models with low CRPS generally have good +coverage and low RMSE, but there are trade-offs when noise is +mis-specified. + +### Parallel coordinates + +![Parallel coordinates](../assets/bayesian_parallel.png) + +Front designs cluster at moderate prior variance and noise scales +close to the true value — the region where the model is neither +overconfident nor excessively vague. + +### CRPS strip plot + +![CRPS strip plot](../assets/bayesian_crps.png) + +## Score-based stacking + +`stack_scores` finds simplex-constrained weights that minimise the +composite MSE across test points. The stacked ensemble often beats +every individual model because it smooths over prior mis-specification: + +```python +--8<-- "examples/bayesian_study.py:stacking" +``` + +## Calibration assessment + +`coverage_curve` evaluates empirical coverage at many nominal levels. +A well-calibrated model tracks the diagonal: + +```python +--8<-- "examples/bayesian_study.py:calibration" +``` + +![Calibration curve](../assets/bayesian_calibration.png) + +## Persistence + +`save_results` writes the `ResultsTable` to disk; `load_results` +reconstructs it. Useful for expensive studies where you want to +separate evaluation from analysis: + +```python +--8<-- "examples/bayesian_study.py:persistence" +``` + +## What to try next + +- Swap `method="morris"` for `method="sobol"` in `screen()` for + variance-based sensitivity indices. +- Use `Constraint` + `feasibility_filter` to enforce + `coverage_95 >= 0.90` before stacking. +- Replace `run_grid` with a two-phase `Study` that screens first + and refines around the Pareto front. +- Try `stack_bayesian()` on models that expose log-likelihood + (requires `arviz`). diff --git a/examples/bayesian_study.py b/examples/bayesian_study.py new file mode 100644 index 0000000..63005fd --- /dev/null +++ b/examples/bayesian_study.py @@ -0,0 +1,409 @@ +"""Bayesian model criticism study. + +A multi-objective trade study over prior hyperparameters for a +Bayesian linear regression. Demonstrates the package's distinctive +scoring rules, calibration assessment, and score-based stacking. + +The model uses a conjugate normal prior, so the posterior is +closed-form — no MCMC, no external sampler, only numpy. +""" + +from __future__ import annotations + +import tempfile +from typing import Any + +import numpy as np + +from trade_study import ( + Annotation, + Direction, + Factor, + FactorType, + Observable, + build_grid, + coverage_curve, + ensemble_predict, + extract_front, + load_results, + plot_calibration, + plot_front, + plot_parallel, + plot_scores, + reduce_factors, + run_grid, + save_results, + score, + screen, + stack_scores, +) + +ASSET_DIR = "docs/assets" + +# ── Ground-truth regression model ────────────────────────────────── + +# --8<-- [start:model] +# True data-generating process: y = a + b*x + eps, eps ~ N(0, sigma_true^2) +TRUE_A = 2.0 # intercept +TRUE_B = 3.0 # slope +SIGMA_TRUE = 0.5 # observation noise std + +N_TEST = 50 # test locations for scoring +RNG_SEED = 42 + +rng = np.random.default_rng(RNG_SEED) +X_TEST = np.linspace(0.0, 1.0, N_TEST) +Y_TEST = TRUE_A + TRUE_B * X_TEST + rng.normal(0.0, SIGMA_TRUE, N_TEST) +# --8<-- [end:model] + + +# ── Conjugate Bayesian regression ────────────────────────────────── + + +# --8<-- [start:posterior] +def bayesian_regression( + x_train: np.ndarray, + y_train: np.ndarray, + prior_var: float, + noise_scale: float, + n_samples: int = 500, + seed: int = RNG_SEED, +) -> tuple[np.ndarray, np.ndarray]: + """Fit a conjugate Bayesian linear regression and draw predictions. + + Prior: beta = (a, b) ~ N(0, prior_var * I) + Likelihood: y | x, beta ~ N(X @ beta, noise_scale^2 * I) + + Args: + x_train: Training inputs, shape (n,). + y_train: Training targets, shape (n,). + prior_var: Prior variance for each coefficient. + noise_scale: Assumed observation noise standard deviation. + n_samples: Number of posterior predictive draws. + seed: Random seed. + + Returns: + Tuple of (posterior_mean_at_test, predictive_samples_at_test). + posterior_mean_at_test has shape (N_TEST,). + predictive_samples_at_test has shape (N_TEST, n_samples). + """ + gen = np.random.default_rng(seed) + + # Design matrices + x_design = np.column_stack([np.ones_like(x_train), x_train]) # (n, 2) + x_star = np.column_stack([np.ones(N_TEST), X_TEST]) # (N_TEST, 2) + + sigma2 = noise_scale**2 + prior_precision = np.eye(2) / prior_var # (2, 2) + + # Conjugate posterior: beta | y ~ N(mu_n, Sigma_n) + gram = x_design.T @ x_design # X^T X + posterior_cov = np.linalg.inv(prior_precision + gram / sigma2) + posterior_mean = posterior_cov @ (x_design.T @ y_train / sigma2) + + # Posterior predictive at test locations + pred_mean = x_star @ posterior_mean # (N_TEST,) + + # Draw beta samples, then project to predictions and add noise + beta_samples = gen.multivariate_normal( + posterior_mean, + posterior_cov, + size=n_samples, + ) # (n_samples, 2) + pred_samples = (x_star @ beta_samples.T) + gen.normal( + 0.0, noise_scale, (N_TEST, n_samples) + ) + + return pred_mean, pred_samples + + +# --8<-- [end:posterior] + + +# ── Simulator and scorer ────────────────────────────────────────── + + +# --8<-- [start:world] +class BayesianRegressionSimulator: + """Generate training data and compute the Bayesian posterior. + + Each config specifies prior hyperparameters and sample size. + The "truth" is the held-out test set; "observations" are the + posterior predictive samples at those test points. + """ + + def generate(self, config: dict[str, Any]) -> tuple[Any, Any]: + """Draw training data, fit posterior, and return test predictions. + + Args: + config: Must contain prior_var, noise_scale, n_obs. + + Returns: + Tuple of (y_test, observation dict with predictive samples + and posterior mean predictions). + """ + gen = np.random.default_rng(RNG_SEED) + n_obs = round(config["n_obs"]) + + # Draw training data from the true model + x_train = gen.uniform(0.0, 1.0, n_obs) + y_train = TRUE_A + TRUE_B * x_train + gen.normal(0.0, SIGMA_TRUE, n_obs) + + # Compute conjugate posterior + pred_mean, pred_samples = bayesian_regression( + x_train, + y_train, + prior_var=config["prior_var"], + noise_scale=config["noise_scale"], + ) + + observations = { + "pred_mean": pred_mean, + "pred_samples": pred_samples, + "n_obs": n_obs, + } + return Y_TEST, observations + + +class BayesianRegressionScorer: + """Score posterior predictions with proper scoring rules.""" + + def score( + self, + truth: Any, + observations: Any, + config: dict[str, Any], + ) -> dict[str, float]: + """Compute CRPS, 95 percent coverage, and RMSE on posterior mean. + + Args: + truth: True test-set values (y_test). + observations: Dict with pred_samples and pred_mean arrays. + config: Hyperparameter dict (unused). + + Returns: + Scores for crps, coverage_95, and rmse. + """ + crps = score("crps", observations["pred_samples"], truth) + cov95 = score("coverage", observations["pred_samples"], truth, level=0.95) + rmse = score("rmse", observations["pred_mean"], truth) + return {"crps": crps, "coverage_95": cov95, "rmse": rmse} + + +# --8<-- [end:world] + + +# ── Observables and factors ──────────────────────────────────────── + +# --8<-- [start:observables] +observables = [ + Observable("crps", Direction.MINIMIZE), + Observable("coverage_95", Direction.MAXIMIZE, weight=0.5), + Observable("rmse", Direction.MINIMIZE), +] +# --8<-- [end:observables] + +# --8<-- [start:factors] +factors = [ + Factor("prior_var", FactorType.CONTINUOUS, bounds=(0.1, 10.0)), + Factor("noise_scale", FactorType.CONTINUOUS, bounds=(0.1, 2.0)), + Factor("n_obs", FactorType.CONTINUOUS, bounds=(10.0, 200.0)), +] +# --8<-- [end:factors] + + +# --8<-- [start:annotation] +compute_cost = Annotation( + name="compute_cost", + lookup=lambda n: float(n) * 0.01, + key="n_obs", +) +# --8<-- [end:annotation] + + +def _run_screening( + world: BayesianRegressionSimulator, + scorer: BayesianRegressionScorer, +) -> None: + """Run Morris screening and print factor importance.""" + + # --8<-- [start:screening] + def run_fn(config: dict[str, Any]) -> dict[str, float]: + """Compose simulator + scorer for screening. + + Returns: + Score dictionary from the scorer. + """ + config = {**config, "n_obs": round(config["n_obs"])} + truth, obs = world.generate(config) + return scorer.score(truth, obs, config) + + importance = screen(run_fn, factors, method="morris", n_trajectories=8, seed=42) + print("Morris screening (mu_star):") + for name, vals in importance.items(): + print(f" {name}: {vals}") + + reduced = reduce_factors(factors, importance, threshold=0.01) + print(f"\nImportant factors (threshold=0.01): {[f.name for f in reduced]}") + # --8<-- [end:screening] + + +def _print_front(results: Any, front_idx: Any) -> None: + """Print the Pareto front table.""" + # --8<-- [start:results] + print(f"\nPareto front: {len(front_idx)} / {results.scores.shape[0]} designs") + + print( + f"\n{'prior_var':>10s} {'noise':>6s} {'n_obs':>5s} " + f"{'CRPS':>6s} {'Cov95':>6s} {'RMSE':>6s} {'Cost':>5s}" + ) + print("-" * 58) + for i in front_idx: + cfg = results.configs[i] + crps_val, cov, rmse_val = results.scores[i] + cost = results.annotations[i, 0] if results.annotations is not None else 0.0 + print( + f"{cfg['prior_var']:10.2f} {cfg['noise_scale']:6.2f} " + f"{cfg['n_obs']:5d} " + f"{crps_val:6.3f} {cov:6.3f} {rmse_val:6.3f} {cost:5.1f}" + ) + # --8<-- [end:results] + + +def _run_stacking(results: Any, front_idx: Any, world: Any) -> None: + """Compute score-based stacking weights and print results.""" + # --8<-- [start:stacking] + # Build a score matrix from the front: per-test-point squared errors + front_predictions = [] + score_matrix_rows = [] + for i in front_idx: + cfg = results.configs[i] + _, obs = world.generate(cfg) + front_predictions.append(obs["pred_mean"]) + sq_errors = (obs["pred_mean"] - Y_TEST) ** 2 + score_matrix_rows.append(sq_errors) + + score_mat = np.array(score_matrix_rows) # (n_front, n_test) + stacking_weights = stack_scores(score_mat, maximize=False) + print("\nStacking weights (MSE-optimal):") + for idx, w in zip(front_idx, stacking_weights, strict=True): + if w > 0.01: + cfg = results.configs[idx] + print( + f" prior_var={cfg['prior_var']:.2f}, " + f"noise={cfg['noise_scale']:.2f}, " + f"n_obs={cfg['n_obs']}: w={w:.3f}" + ) + + ens_mean = ensemble_predict(front_predictions, stacking_weights) + ens_rmse = float(np.sqrt(np.mean((ens_mean - Y_TEST) ** 2))) + best_single = float(results.scores[front_idx, 2].min()) + print(f"\nBest single-model RMSE: {best_single:.4f}") + print(f"Stacked ensemble RMSE: {ens_rmse:.4f}") + # --8<-- [end:stacking] + + +def _run_calibration( + results: Any, + front_idx: Any, + world: Any, +) -> tuple[Any, Any]: + """Assess calibration for the best-CRPS model. + + Returns: + Tuple of (nominal_levels, empirical_coverage) arrays. + """ + # --8<-- [start:calibration] + best_crps_idx = front_idx[int(np.argmin(results.scores[front_idx, 0]))] + best_cfg = results.configs[best_crps_idx] + _, best_obs = world.generate(best_cfg) + nominal, empirical = coverage_curve(best_obs["pred_samples"], Y_TEST) + pv = best_cfg["prior_var"] + print(f"\nCalibration (best CRPS, prior_var={pv:.2f}):") + for level in (0.5, 0.9, 0.95): + closest = empirical[np.argmin(np.abs(nominal - level))] + print(f" Nominal {level:.0%} -> Empirical {closest:.1%}") + # --8<-- [end:calibration] + return nominal, empirical + + +def _run_persistence(results: Any) -> None: + """Round-trip save/load and verify.""" + # --8<-- [start:persistence] + with tempfile.TemporaryDirectory() as tmp: + save_results(results, tmp) + loaded = load_results(tmp) + n_saved = loaded.scores.shape[0] + print(f"\nSaved and reloaded {n_saved} results successfully.") + # --8<-- [end:persistence] + + +def _save_plots(results: Any, directions: Any, nominal: Any, empirical: Any) -> None: + """Generate and save all figures.""" + # --8<-- [start:plots] + import matplotlib.pyplot as plt + + # Pareto front scatter + fig_front, _ = plot_front(results, directions) + fig_front.savefig(f"{ASSET_DIR}/bayesian_front.png", dpi=150, bbox_inches="tight") + print("\nSaved bayesian_front.png") + plt.close(fig_front) + + # Parallel coordinates + fig_par, _ = plot_parallel(results, directions) + fig_par.savefig(f"{ASSET_DIR}/bayesian_parallel.png", dpi=150, bbox_inches="tight") + print("Saved bayesian_parallel.png") + plt.close(fig_par) + + # CRPS strip plot + fig_crps, _ = plot_scores(results, "crps", directions) + fig_crps.savefig(f"{ASSET_DIR}/bayesian_crps.png", dpi=150, bbox_inches="tight") + print("Saved bayesian_crps.png") + plt.close(fig_crps) + + # Calibration curve for best model + fig_cal, _ = plot_calibration(nominal, empirical) + fig_cal.savefig( + f"{ASSET_DIR}/bayesian_calibration.png", dpi=150, bbox_inches="tight" + ) + print("Saved bayesian_calibration.png") + plt.close(fig_cal) + # --8<-- [end:plots] + + +def main() -> None: + """Run the Bayesian model criticism study.""" + world = BayesianRegressionSimulator() + scorer = BayesianRegressionScorer() + + _run_screening(world, scorer) + + # --8<-- [start:run] + grid = build_grid(factors, method="lhs", n_samples=60, seed=42) + for cfg in grid: + cfg["n_obs"] = round(cfg["n_obs"]) + + print(f"\nLHS grid: {len(grid)} configurations") + + results = run_grid( + world=world, + scorer=scorer, + grid=grid, + observables=observables, + annotations=[compute_cost], + ) + # --8<-- [end:run] + + directions = [o.direction for o in observables] + weights = [o.weight for o in observables] + front_idx = extract_front(results.scores, directions, weights) + + _print_front(results, front_idx) + _run_stacking(results, front_idx, world) + nominal, empirical = _run_calibration(results, front_idx, world) + _run_persistence(results) + _save_plots(results, directions, nominal, empirical) + + +if __name__ == "__main__": + main() diff --git a/mkdocs.yml b/mkdocs.yml index 1ff779b..7d9c027 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -65,6 +65,7 @@ nav: - Reactor Design (CSTR): guide/cstr.md - Hyperparameter Sweep (sklearn): guide/sklearn.md - Monitoring Station Design: guide/monitoring.md + - Bayesian Model Criticism: guide/bayesian.md - API Reference: - Protocols: api/protocols.md - Design: api/design.md