diff --git a/docs/assets/bayesian_calibration.png b/docs/assets/bayesian_calibration.png
new file mode 100644
index 0000000..1601f98
Binary files /dev/null and b/docs/assets/bayesian_calibration.png differ
diff --git a/docs/assets/bayesian_crps.png b/docs/assets/bayesian_crps.png
new file mode 100644
index 0000000..3bb02a4
Binary files /dev/null and b/docs/assets/bayesian_crps.png differ
diff --git a/docs/assets/bayesian_front.png b/docs/assets/bayesian_front.png
new file mode 100644
index 0000000..fb0cda9
Binary files /dev/null and b/docs/assets/bayesian_front.png differ
diff --git a/docs/assets/bayesian_parallel.png b/docs/assets/bayesian_parallel.png
new file mode 100644
index 0000000..5acb7ed
Binary files /dev/null and b/docs/assets/bayesian_parallel.png differ
diff --git a/docs/guide/bayesian.md b/docs/guide/bayesian.md
new file mode 100644
index 0000000..6725f19
--- /dev/null
+++ b/docs/guide/bayesian.md
@@ -0,0 +1,200 @@
+# Bayesian Model Criticism
+
+This tutorial walks through a **scoring-and-stacking** workflow for
+Bayesian models.  It demonstrates the features that set `trade-study`
+apart from conventional hyperparameter optimizers: proper scoring
+rules, calibration assessment, and score-based model averaging.
+
+The underlying model is a conjugate Bayesian linear regression — no
+MCMC sampler or external probabilistic programming library needed,
+only numpy.
+
+!!! tip "Run it yourself"
+
+    The full runnable script is at
+    [`examples/bayesian_study.py`](https://github.com/jcm-sci/trade-study/blob/main/examples/bayesian_study.py).
+
+    ```bash
+    uv run --extra examples python examples/bayesian_study.py
+    ```
+
+## The problem
+
+We want to fit a Bayesian regression
+$y = a + bx + \varepsilon$
+where $\varepsilon \sim \mathcal{N}(0,\sigma_\text{true}^2)$ and the
+true coefficients are known ($a=2$, $b=3$, $\sigma_\text{true}=0.5$).
+
+The design question is: **which prior hyperparameters and sample sizes
+produce well-calibrated, accurate posteriors?**  Specifically, each
+configuration controls three knobs:
+
+| Factor | Meaning |
+|--------|---------|
+| `prior_var` | Variance of the Normal prior on $(a, b)$ |
+| `noise_scale` | Assumed observation noise $\sigma$ (may disagree with truth) |
+| `n_obs` | Number of training observations |
+
+Choosing `noise_scale` too small produces overconfident posteriors;
+too large gives conservative intervals.  A tight prior (`prior_var`
+small) biases the coefficients toward zero.  More data always helps,
+but at a computational cost.
+
+## Ground-truth model
+
+```python
+--8<-- "examples/bayesian_study.py:model"
+```
+
+## Conjugate posterior
+
+Because we use a Normal prior with known noise variance, the
+posterior over regression coefficients is available in closed form.
+Given design matrix $\mathbf{X}$ and prior precision
+$\boldsymbol{\Lambda}_0 = \sigma_\text{prior}^{-2}\mathbf{I}$:
+
+$$
+\boldsymbol{\Sigma}_n
+  = \left(\boldsymbol{\Lambda}_0
+    + \frac{\mathbf{X}^\top\mathbf{X}}{\sigma^2}\right)^{-1}
+\qquad
+\boldsymbol{\mu}_n
+  = \boldsymbol{\Sigma}_n
+    \frac{\mathbf{X}^\top\mathbf{y}}{\sigma^2}
+$$
+
+We draw posterior predictive samples by projecting coefficient draws
+through the test design matrix and adding observation noise.
+
+```python
+--8<-- "examples/bayesian_study.py:posterior"
+```
+
+## Simulator and scorer
+
+The **simulator** generates training data from the true model, fits
+the conjugate posterior, and returns posterior predictive samples at
+held-out test locations.  The **scorer** evaluates three complementary
+metrics:
+
+- **CRPS** (Continuous Ranked Probability Score) — a proper scoring
+  rule that rewards sharp, well-calibrated predictive distributions.
+- **Coverage (95 %)** — the fraction of test points whose true value
+  falls inside the 95 % posterior predictive interval.
+- **RMSE** — root mean squared error of the posterior mean.
+
+```python
+--8<-- "examples/bayesian_study.py:world"
+```
+
+## Observables and factors
+
+Coverage is given lower weight (0.5) because a model can trivially
+achieve 100 % coverage by being very wide; CRPS already penalises
+that.
+
+```python
+--8<-- "examples/bayesian_study.py:observables"
+```
+
+```python
+--8<-- "examples/bayesian_study.py:factors"
+```
+
+## Annotations
+
+An `Annotation` attaches external metadata — here, a simple linear
+cost model for the number of observations:
+
+```python
+--8<-- "examples/bayesian_study.py:annotation"
+```
+
+## Factor screening
+
+Before running the full grid, Morris screening identifies which
+factors influence the scores most.  `reduce_factors` drops any
+factor whose $\mu^*$ falls below a threshold:
+
+```python
+--8<-- "examples/bayesian_study.py:screening"
+```
+
+## Grid evaluation
+
+A 60-point Latin hypercube covers the three-factor space.  `run_grid`
+evaluates every configuration and collects scores and annotations into
+a `ResultsTable`:
+
+```python
+--8<-- "examples/bayesian_study.py:run"
+```
+
+## Pareto analysis
+
+```python
+--8<-- "examples/bayesian_study.py:results"
+```
+
+### Pareto front scatter matrix
+
+![Pareto front](../assets/bayesian_front.png)
+
+With three objectives the front is a surface; `plot_front` shows all
+pairwise projections.  Models with low CRPS generally have good
+coverage and low RMSE, but there are trade-offs when noise is
+mis-specified.
+
+### Parallel coordinates
+
+![Parallel coordinates](../assets/bayesian_parallel.png)
+
+Front designs cluster at moderate prior variance and noise scales
+close to the true value — the region where the model is neither
+overconfident nor excessively vague.
+
+### CRPS strip plot
+
+![CRPS strip plot](../assets/bayesian_crps.png)
+
+## Score-based stacking
+
+`stack_scores` finds simplex-constrained weights that minimise the
+composite MSE across test points.  The stacked ensemble often beats
+every individual model because it smooths over prior mis-specification:
+
+```python
+--8<-- "examples/bayesian_study.py:stacking"
+```
+
+## Calibration assessment
+
+`coverage_curve` evaluates empirical coverage at many nominal levels.
+A well-calibrated model tracks the diagonal:
+
+```python
+--8<-- "examples/bayesian_study.py:calibration"
+```
+
+![Calibration curve](../assets/bayesian_calibration.png)
+
+## Persistence
+
+`save_results` writes the `ResultsTable` to disk; `load_results`
+reconstructs it.  Useful for expensive studies where you want to
+separate evaluation from analysis:
+
+```python
+--8<-- "examples/bayesian_study.py:persistence"
+```
+
+## What to try next
+
+- Swap `method="morris"` for `method="sobol"` in `screen()` for
+  variance-based sensitivity indices.
+- Use `Constraint` + `feasibility_filter` to enforce
+  `coverage_95 >= 0.90` before stacking.
+- Replace `run_grid` with a two-phase `Study` that screens first
+  and refines around the Pareto front.
+- Try `stack_bayesian()` on models that expose log-likelihood
+  (requires `arviz`).
diff --git a/examples/bayesian_study.py b/examples/bayesian_study.py
new file mode 100644
index 0000000..63005fd
--- /dev/null
+++ b/examples/bayesian_study.py
@@ -0,0 +1,409 @@
+"""Bayesian model criticism study.
+
+A multi-objective trade study over prior hyperparameters for a
+Bayesian linear regression.  Demonstrates the package's distinctive
+scoring rules, calibration assessment, and score-based stacking.
+
+The model uses a conjugate normal prior, so the posterior is
+closed-form — no MCMC, no external sampler, only numpy.
+"""
+
+from __future__ import annotations
+
+import tempfile
+from typing import Any
+
+import numpy as np
+
+from trade_study import (
+    Annotation,
+    Direction,
+    Factor,
+    FactorType,
+    Observable,
+    build_grid,
+    coverage_curve,
+    ensemble_predict,
+    extract_front,
+    load_results,
+    plot_calibration,
+    plot_front,
+    plot_parallel,
+    plot_scores,
+    reduce_factors,
+    run_grid,
+    save_results,
+    score,
+    screen,
+    stack_scores,
+)
+
+ASSET_DIR = "docs/assets"
+
+# ── Ground-truth regression model ──────────────────────────────────
+
+# --8<-- [start:model]
+# True data-generating process:  y = a + b*x + eps,  eps ~ N(0, sigma_true^2)
+TRUE_A = 2.0  # intercept
+TRUE_B = 3.0  # slope
+SIGMA_TRUE = 0.5  # observation noise std
+
+N_TEST = 50  # test locations for scoring
+RNG_SEED = 42
+
+rng = np.random.default_rng(RNG_SEED)
+X_TEST = np.linspace(0.0, 1.0, N_TEST)
+Y_TEST = TRUE_A + TRUE_B * X_TEST + rng.normal(0.0, SIGMA_TRUE, N_TEST)
+# --8<-- [end:model]
+
+
+# ── Conjugate Bayesian regression ──────────────────────────────────
+
+
+# --8<-- [start:posterior]
+def bayesian_regression(
+    x_train: np.ndarray,
+    y_train: np.ndarray,
+    prior_var: float,
+    noise_scale: float,
+    n_samples: int = 500,
+    seed: int = RNG_SEED,
+) -> tuple[np.ndarray, np.ndarray]:
+    """Fit a conjugate Bayesian linear regression and draw predictions.
+
+    Prior:  beta = (a, b) ~ N(0, prior_var * I)
+    Likelihood:  y | x, beta ~ N(X @ beta, noise_scale^2 * I)
+
+    Args:
+        x_train: Training inputs, shape (n,).
+        y_train: Training targets, shape (n,).
+        prior_var: Prior variance for each coefficient.
+        noise_scale: Assumed observation noise standard deviation.
+        n_samples: Number of posterior predictive draws.
+        seed: Random seed.
+
+    Returns:
+        Tuple of (posterior_mean_at_test, predictive_samples_at_test).
+        posterior_mean_at_test has shape (N_TEST,).
+        predictive_samples_at_test has shape (N_TEST, n_samples).
+    """
+    gen = np.random.default_rng(seed)
+
+    # Design matrices
+    x_design = np.column_stack([np.ones_like(x_train), x_train])  # (n, 2)
+    x_star = np.column_stack([np.ones(N_TEST), X_TEST])  # (N_TEST, 2)
+
+    sigma2 = noise_scale**2
+    prior_precision = np.eye(2) / prior_var  # (2, 2)
+
+    # Conjugate posterior:  beta | y ~ N(mu_n, Sigma_n)
+    gram = x_design.T @ x_design  # X^T X
+    posterior_cov = np.linalg.inv(prior_precision + gram / sigma2)
+    posterior_mean = posterior_cov @ (x_design.T @ y_train / sigma2)
+
+    # Posterior predictive at test locations
+    pred_mean = x_star @ posterior_mean  # (N_TEST,)
+
+    # Draw beta samples, then project to predictions and add noise
+    beta_samples = gen.multivariate_normal(
+        posterior_mean,
+        posterior_cov,
+        size=n_samples,
+    )  # (n_samples, 2)
+    pred_samples = (x_star @ beta_samples.T) + gen.normal(
+        0.0, noise_scale, (N_TEST, n_samples)
+    )
+
+    return pred_mean, pred_samples
+
+
+# --8<-- [end:posterior]
+
+
+# ── Simulator and scorer ──────────────────────────────────────────
+
+
+# --8<-- [start:world]
+class BayesianRegressionSimulator:
+    """Generate training data and compute the Bayesian posterior.
+
+    Each config specifies prior hyperparameters and sample size.
+    The "truth" is the held-out test set; "observations" are the
+    posterior predictive samples at those test points.
+    """
+
+    def generate(self, config: dict[str, Any]) -> tuple[Any, Any]:
+        """Draw training data, fit posterior, and return test predictions.
+
+        Args:
+            config: Must contain prior_var, noise_scale, n_obs.
+
+        Returns:
+            Tuple of (y_test, observation dict with predictive samples
+            and posterior mean predictions).
+        """
+        gen = np.random.default_rng(RNG_SEED)
+        n_obs = round(config["n_obs"])
+
+        # Draw training data from the true model
+        x_train = gen.uniform(0.0, 1.0, n_obs)
+        y_train = TRUE_A + TRUE_B * x_train + gen.normal(0.0, SIGMA_TRUE, n_obs)
+
+        # Compute conjugate posterior
+        pred_mean, pred_samples = bayesian_regression(
+            x_train,
+            y_train,
+            prior_var=config["prior_var"],
+            noise_scale=config["noise_scale"],
+        )
+
+        observations = {
+            "pred_mean": pred_mean,
+            "pred_samples": pred_samples,
+            "n_obs": n_obs,
+        }
+        return Y_TEST, observations
+
+
+class BayesianRegressionScorer:
+    """Score posterior predictions with proper scoring rules."""
+
+    def score(
+        self,
+        truth: Any,
+        observations: Any,
+        config: dict[str, Any],
+    ) -> dict[str, float]:
+        """Compute CRPS, 95 percent coverage, and RMSE on posterior mean.
+
+        Args:
+            truth: True test-set values (y_test).
+            observations: Dict with pred_samples and pred_mean arrays.
+            config: Hyperparameter dict (unused).
+
+        Returns:
+            Scores for crps, coverage_95, and rmse.
+        """
+        crps = score("crps", observations["pred_samples"], truth)
+        cov95 = score("coverage", observations["pred_samples"], truth, level=0.95)
+        rmse = score("rmse", observations["pred_mean"], truth)
+        return {"crps": crps, "coverage_95": cov95, "rmse": rmse}
+
+
+# --8<-- [end:world]
+
+
+# ── Observables and factors ────────────────────────────────────────
+
+# --8<-- [start:observables]
+observables = [
+    Observable("crps", Direction.MINIMIZE),
+    Observable("coverage_95", Direction.MAXIMIZE, weight=0.5),
+    Observable("rmse", Direction.MINIMIZE),
+]
+# --8<-- [end:observables]
+
+# --8<-- [start:factors]
+factors = [
+    Factor("prior_var", FactorType.CONTINUOUS, bounds=(0.1, 10.0)),
+    Factor("noise_scale", FactorType.CONTINUOUS, bounds=(0.1, 2.0)),
+    Factor("n_obs", FactorType.CONTINUOUS, bounds=(10.0, 200.0)),
+]
+# --8<-- [end:factors]
+
+
+# --8<-- [start:annotation]
+compute_cost = Annotation(
+    name="compute_cost",
+    lookup=lambda n: float(n) * 0.01,
+    key="n_obs",
+)
+# --8<-- [end:annotation]
+
+
+def _run_screening(
+    world: BayesianRegressionSimulator,
+    scorer: BayesianRegressionScorer,
+) -> None:
+    """Run Morris screening and print factor importance."""
+
+    # --8<-- [start:screening]
+    def run_fn(config: dict[str, Any]) -> dict[str, float]:
+        """Compose simulator + scorer for screening.
+
+        Returns:
+            Score dictionary from the scorer.
+        """
+        config = {**config, "n_obs": round(config["n_obs"])}
+        truth, obs = world.generate(config)
+        return scorer.score(truth, obs, config)
+
+    importance = screen(run_fn, factors, method="morris", n_trajectories=8, seed=42)
+    print("Morris screening (mu_star):")
+    for name, vals in importance.items():
+        print(f"  {name}: {vals}")
+
+    reduced = reduce_factors(factors, importance, threshold=0.01)
+    print(f"\nImportant factors (threshold=0.01): {[f.name for f in reduced]}")
+    # --8<-- [end:screening]
+
+
+def _print_front(results: Any, front_idx: Any) -> None:
+    """Print the Pareto front table."""
+    # --8<-- [start:results]
+    print(f"\nPareto front: {len(front_idx)} / {results.scores.shape[0]} designs")
+
+    print(
+        f"\n{'prior_var':>10s}  {'noise':>6s}  {'n_obs':>5s}  "
+        f"{'CRPS':>6s}  {'Cov95':>6s}  {'RMSE':>6s}  {'Cost':>5s}"
+    )
+    print("-" * 58)
+    for i in front_idx:
+        cfg = results.configs[i]
+        crps_val, cov, rmse_val = results.scores[i]
+        cost = results.annotations[i, 0] if results.annotations is not None else 0.0
+        print(
+            f"{cfg['prior_var']:10.2f}  {cfg['noise_scale']:6.2f}  "
+            f"{cfg['n_obs']:5d}  "
+            f"{crps_val:6.3f}  {cov:6.3f}  {rmse_val:6.3f}  {cost:5.1f}"
+        )
+    # --8<-- [end:results]
+
+
+def _run_stacking(results: Any, front_idx: Any, world: Any) -> None:
+    """Compute score-based stacking weights and print results."""
+    # --8<-- [start:stacking]
+    # Build a score matrix from the front: per-test-point squared errors
+    front_predictions = []
+    score_matrix_rows = []
+    for i in front_idx:
+        cfg = results.configs[i]
+        _, obs = world.generate(cfg)
+        front_predictions.append(obs["pred_mean"])
+        sq_errors = (obs["pred_mean"] - Y_TEST) ** 2
+        score_matrix_rows.append(sq_errors)
+
+    score_mat = np.array(score_matrix_rows)  # (n_front, n_test)
+    stacking_weights = stack_scores(score_mat, maximize=False)
+    print("\nStacking weights (MSE-optimal):")
+    for idx, w in zip(front_idx, stacking_weights, strict=True):
+        if w > 0.01:
+            cfg = results.configs[idx]
+            print(
+                f"  prior_var={cfg['prior_var']:.2f}, "
+                f"noise={cfg['noise_scale']:.2f}, "
+                f"n_obs={cfg['n_obs']}: w={w:.3f}"
+            )
+
+    ens_mean = ensemble_predict(front_predictions, stacking_weights)
+    ens_rmse = float(np.sqrt(np.mean((ens_mean - Y_TEST) ** 2)))
+    best_single = float(results.scores[front_idx, 2].min())
+    print(f"\nBest single-model RMSE: {best_single:.4f}")
+    print(f"Stacked ensemble RMSE:  {ens_rmse:.4f}")
+    # --8<-- [end:stacking]
+
+
+def _run_calibration(
+    results: Any,
+    front_idx: Any,
+    world: Any,
+) -> tuple[Any, Any]:
+    """Assess calibration for the best-CRPS model.
+
+    Returns:
+        Tuple of (nominal_levels, empirical_coverage) arrays.
+    """
+    # --8<-- [start:calibration]
+    best_crps_idx = front_idx[int(np.argmin(results.scores[front_idx, 0]))]
+    best_cfg = results.configs[best_crps_idx]
+    _, best_obs = world.generate(best_cfg)
+    nominal, empirical = coverage_curve(best_obs["pred_samples"], Y_TEST)
+    pv = best_cfg["prior_var"]
+    print(f"\nCalibration (best CRPS, prior_var={pv:.2f}):")
+    for level in (0.5, 0.9, 0.95):
+        closest = empirical[np.argmin(np.abs(nominal - level))]
+        print(f"  Nominal {level:.0%}  -> Empirical {closest:.1%}")
+    # --8<-- [end:calibration]
+    return nominal, empirical
+
+
+def _run_persistence(results: Any) -> None:
+    """Round-trip save/load and verify."""
+    # --8<-- [start:persistence]
+    with tempfile.TemporaryDirectory() as tmp:
+        save_results(results, tmp)
+        loaded = load_results(tmp)
+        n_saved = loaded.scores.shape[0]
+        print(f"\nSaved and reloaded {n_saved} results successfully.")
+    # --8<-- [end:persistence]
+
+
+def _save_plots(results: Any, directions: Any, nominal: Any, empirical: Any) -> None:
+    """Generate and save all figures."""
+    # --8<-- [start:plots]
+    import matplotlib.pyplot as plt
+
+    # Pareto front scatter
+    fig_front, _ = plot_front(results, directions)
+    fig_front.savefig(f"{ASSET_DIR}/bayesian_front.png", dpi=150, bbox_inches="tight")
+    print("\nSaved bayesian_front.png")
+    plt.close(fig_front)
+
+    # Parallel coordinates
+    fig_par, _ = plot_parallel(results, directions)
+    fig_par.savefig(f"{ASSET_DIR}/bayesian_parallel.png", dpi=150, bbox_inches="tight")
+    print("Saved bayesian_parallel.png")
+    plt.close(fig_par)
+
+    # CRPS strip plot
+    fig_crps, _ = plot_scores(results, "crps", directions)
+    fig_crps.savefig(f"{ASSET_DIR}/bayesian_crps.png", dpi=150, bbox_inches="tight")
+    print("Saved bayesian_crps.png")
+    plt.close(fig_crps)
+
+    # Calibration curve for best model
+    fig_cal, _ = plot_calibration(nominal, empirical)
+    fig_cal.savefig(
+        f"{ASSET_DIR}/bayesian_calibration.png", dpi=150, bbox_inches="tight"
+    )
+    print("Saved bayesian_calibration.png")
+    plt.close(fig_cal)
+    # --8<-- [end:plots]
+
+
+def main() -> None:
+    """Run the Bayesian model criticism study."""
+    world = BayesianRegressionSimulator()
+    scorer = BayesianRegressionScorer()
+
+    _run_screening(world, scorer)
+
+    # --8<-- [start:run]
+    grid = build_grid(factors, method="lhs", n_samples=60, seed=42)
+    for cfg in grid:
+        cfg["n_obs"] = round(cfg["n_obs"])
+
+    print(f"\nLHS grid: {len(grid)} configurations")
+
+    results = run_grid(
+        world=world,
+        scorer=scorer,
+        grid=grid,
+        observables=observables,
+        annotations=[compute_cost],
+    )
+    # --8<-- [end:run]
+
+    directions = [o.direction for o in observables]
+    weights = [o.weight for o in observables]
+    front_idx = extract_front(results.scores, directions, weights)
+
+    _print_front(results, front_idx)
+    _run_stacking(results, front_idx, world)
+    nominal, empirical = _run_calibration(results, front_idx, world)
+    _run_persistence(results)
+    _save_plots(results, directions, nominal, empirical)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/mkdocs.yml b/mkdocs.yml
index 1ff779b..7d9c027 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -65,6 +65,7 @@ nav:
       - Reactor Design (CSTR): guide/cstr.md
       - Hyperparameter Sweep (sklearn): guide/sklearn.md
       - Monitoring Station Design: guide/monitoring.md
+      - Bayesian Model Criticism: guide/bayesian.md
   - API Reference:
       - Protocols: api/protocols.md
       - Design: api/design.md