From d3f07986e72ad291e10e5a11844c18752cddf7cc Mon Sep 17 00:00:00 2001 From: Koen van Greevenbroek Date: Tue, 30 Jun 2026 17:37:06 -0700 Subject: [PATCH 1/8] feat: make health module optional and decouple GBD diet anchoring Default health.enabled to false so the workflow runs end to end without the manually-downloaded IHME GBD data. Add diet.anchor_groups_to_gbd (sentinel "match_health" follows health.enabled) to control GBD anchoring of the baseline diet independently of the health burden; anchoring was previously unconditional. Gate the health data-prep rules, health stores, and solve/analysis/ plotting inputs, the cluster manifest, and cluster-dependent plots on whether health is enabled (base config or any scenario) or anchoring is on. Wrap the health.smk include accordingly, guard the anchoring-off baseline-diet path, and raise a clear startup error when the GBD data is needed but absent. Pin health.enabled / anchoring on the configs that rely on it (validation, gsa, tutorials, doc figures, calibration) to preserve their behaviour. Calibration artefacts stay the GBD-anchored ones; regenerating them for the anchoring-off default is deferred (the dual-based cost/stability steps require Gurobi). Document the option, the quantitative baseline-diet differences, the refined-grain caveat, and the calibration coupling. --- CHANGELOG.md | 21 ++++ config/calibration/cost.yaml | 8 ++ config/calibration/feed.yaml | 7 +- config/calibration/food_demand.yaml | 7 +- config/calibration/food_waste.yaml | 7 +- config/calibration/stability.yaml | 7 +- config/default.yaml | 27 ++++- config/example.yaml | 11 ++ config/gsa.yaml | 6 + config/gsa_fixed_diet.yaml | 5 + config/schemas/config.schema.yaml | 7 ++ config/tutorial/01_ghg_prices.yaml | 1 + config/tutorial/02_consumer_values.yaml | 1 + config/validation.yaml | 1 + docs/calibration.rst | 25 ++++ docs/config/doc_figures.yaml | 8 +- docs/config/doc_validation.yaml | 1 + docs/current_diets.rst | 119 +++++++++++++++++++- docs/data_sources.rst | 21 +++- docs/health.rst | 21 ++++ tests/config/test.yaml | 6 + workflow/Snakefile | 32 ++++-- workflow/rules/analysis.smk | 35 +++++- workflow/rules/common.smk | 80 +++++++++++++ workflow/rules/deviation_penalty.smk | 42 +++++-- workflow/rules/diet.smk | 14 ++- workflow/rules/model.smk | 56 +++++++-- workflow/scripts/analysis/analyze_model.py | 78 ++++++++----- workflow/scripts/build_model.py | 42 ++++--- workflow/scripts/estimate_baseline_diet.py | 58 ++++++---- workflow/scripts/solve_and_analyze_model.py | 20 +++- workflow/scripts/solve_namespace.py | 40 ++++--- 32 files changed, 681 insertions(+), 133 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 14265aea..284341c9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,27 @@ introduce breaking changes to configuration and outputs. ## [Unreleased] +### Changed + +- The health module is now **disabled by default** (`health.enabled: false`). + With health off, the workflow no longer requires the manually-downloaded + IHME GBD data and runs end to end without it; a clear startup error is + raised if health (or GBD anchoring) is enabled but the data is absent. + +### Added + +- New `diet.anchor_groups_to_gbd` option that decouples GBD anchoring of the + baseline diet's risk-factor food groups from the health module. Defaults to + the sentinel `match_health` (follow `health.enabled`); set `true`/`false` to + control it independently. Previously anchoring was unconditional. See + `docs/current_diets.rst` for a quantitative description of the difference and + the refined-grain caveat. Note that the baseline diet feeds calibration, so + changing this (or `health.enabled`) requires re-running `tools/calibrate`. + The committed calibration artefacts remain the GBD-anchored ones (the + calibration configs pin anchoring on to reproduce them); regenerating them + for the anchoring-off default is deferred to a dedicated calibration change + because the dual-based cost/stability steps require Gurobi. + ## [0.1.0] - 2026-06-15 First public release of GLADE (Global Land, Agriculture, Diet and Emissions), diff --git a/config/calibration/cost.yaml b/config/calibration/cost.yaml index 76f01670..b4169819 100644 --- a/config/calibration/cost.yaml +++ b/config/calibration/cost.yaml @@ -128,6 +128,14 @@ validation: water: supply_scenario: "current_use" +# The committed calibration artefacts under data/curated/calibration/ were +# generated against the GBD-anchored baseline diet, so anchoring is pinned on +# here to reproduce them. Regenerating against the new default (anchoring off) +# is deferred to a dedicated calibration PR -- the dual-based cost/stability +# steps require Gurobi. See docs/calibration.rst. +diet: + anchor_groups_to_gbd: true + health: enabled: false diff --git a/config/calibration/feed.yaml b/config/calibration/feed.yaml index 7acaefce..994ea0d3 100644 --- a/config/calibration/feed.yaml +++ b/config/calibration/feed.yaml @@ -75,6 +75,11 @@ exogenous_feed_calibration: generate: true # Generate calibration from solved model scenario: feed_uncal +# Anchoring pinned on to reproduce the committed (anchored) calibration +# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). +diet: + anchor_groups_to_gbd: true + health: value_per_yll: 0 @@ -83,5 +88,5 @@ emissions: solving: io_api: "direct" - solver: "gurobi" + solver: gurobi threads: 2 diff --git a/config/calibration/food_demand.yaml b/config/calibration/food_demand.yaml index aaf60fb1..48cbce7e 100644 --- a/config/calibration/food_demand.yaml +++ b/config/calibration/food_demand.yaml @@ -68,6 +68,11 @@ food_demand_calibration: generate: true # Generate calibration from solved model scenario: "food_demand_uncal" +# Anchoring pinned on to reproduce the committed (anchored) calibration +# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). +diet: + anchor_groups_to_gbd: true + health: value_per_yll: 0 @@ -76,5 +81,5 @@ emissions: solving: io_api: "direct" - solver: "gurobi" + solver: gurobi threads: 2 diff --git a/config/calibration/food_waste.yaml b/config/calibration/food_waste.yaml index 45af8e39..8b9a453f 100644 --- a/config/calibration/food_waste.yaml +++ b/config/calibration/food_waste.yaml @@ -66,6 +66,11 @@ food_loss_waste_calibration: generate: true # Generate calibration from solved model scenario: "food_waste_uncal" +# Anchoring pinned on to reproduce the committed (anchored) calibration +# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). +diet: + anchor_groups_to_gbd: true + health: value_per_yll: 0 @@ -74,5 +79,5 @@ emissions: solving: io_api: "direct" - solver: "gurobi" + solver: gurobi threads: 2 diff --git a/config/calibration/stability.yaml b/config/calibration/stability.yaml index 23d67ddc..654431ac 100644 --- a/config/calibration/stability.yaml +++ b/config/calibration/stability.yaml @@ -32,6 +32,11 @@ baseline_year: 2020 # consumer values reflect raw revealed preferences). emissions: ghg_price: 0 +# Anchoring pinned on to reproduce the committed (anchored) calibration +# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). +diet: + anchor_groups_to_gbd: true + health: value_per_yll: 0 @@ -73,7 +78,7 @@ deviation_penalty: generate: true # trigger the calibrate_deviation_penalty rule solving: - solver: "gurobi" + solver: gurobi io_api: "direct" # Solves run sequentially in the iteration, so give each solve more # threads instead of running multiple solves in parallel. diff --git a/config/default.yaml b/config/default.yaml index 7d8d5fad..1497443c 100644 --- a/config/default.yaml +++ b/config/default.yaml @@ -839,6 +839,24 @@ weight_conversion: # --- section: diet --- diet: baseline_age: "All ages" + # Whether to anchor the per-country baseline-diet totals of the health + # risk-factor food groups (health.risk_factors, e.g. fruits, vegetables, + # red_meat, ...) to GBD dietary-exposure intake instead of the GDD/FAOSTAT + # estimate. Anchoring aligns the baseline diet with the same intake basis + # the GBD relative-risk curves are calibrated against, so it is required + # for attributable-burden numbers to be comparable to GBD; it is otherwise + # a modest, mostly distributional change to the baseline diet (see + # docs/current_diets.rst). The sentinel "match_health" follows health.enabled, so a + # health run anchors and a no-health run does not (and then needs no IHME + # GBD data on disk). Set explicitly to true/false to decouple the two. + # NOTE: refined "grain" is additionally adjusted by the whole-grain cereal + # residual fix whenever whole_grains is anchored -- see docs/current_diets.rst. + # CALIBRATION COUPLING: this flag changes the baseline diet, which the + # calibration artefacts under data/curated/calibration/ are fit against. + # Changing it (directly or via health.enabled) means the committed + # calibration no longer matches; rerun `tools/calibrate` with the same + # anchoring setting before trusting results (see docs/calibration.rst). + anchor_groups_to_gbd: match_health # Foods whose per-country intake is computed directly from FAOSTAT Food # Balance Sheet supply rather than disaggregated from GDD/GBD group # totals. For each listed food the override sets @@ -1217,7 +1235,14 @@ commodities: # --- section: health --- health: - enabled: true # Whether to include health costs in the objective function + # Whether to include the dietary health-burden module. When false, the + # manually-downloaded IHME GBD data is not needed (see docs/health.rst). + # IMPORTANT: this also drives diet.anchor_groups_to_gbd (sentinel + # "match_health"), so flipping it changes the baseline diet. The git-tracked + # calibration artefacts under data/curated/calibration/ are computed against a + # specific baseline diet, so after changing this (or anchoring) you must rerun + # `tools/calibrate` with matching options (see docs/calibration.rst). + enabled: false region_clusters: 30 breakpoint_rel_tol: 0.05 # Max Stage 1 PWL deviation as a fraction of each RR curve's amplitude (5%) log_rr_points: 15 diff --git a/config/example.yaml b/config/example.yaml index e292f23b..4ac27c88 100644 --- a/config/example.yaml +++ b/config/example.yaml @@ -27,6 +27,17 @@ name: "example" # emissions: # ghg_price: 200 # USD per tonne CO2-eq +# Enable the dietary health-burden module. This requires the manually- +# downloaded IHME GBD data (see docs/health.rst); when left disabled (the +# default) none of that data is needed. Enabling health also anchors the +# baseline diet to GBD intake exposure (diet.anchor_groups_to_gbd defaults to +# "match_health"); set that key explicitly to decouple the two. +# health: +# enabled: true +# value_per_yll: 50000 # USD per year of life lost (0 = evaluate post-hoc only) +# diet: +# anchor_groups_to_gbd: true # or false to keep the GDD/FAOSTAT-only diet + # Restrict to specific countries (ISO 3166-1 alpha-3 codes) # countries: # - NOR diff --git a/config/gsa.yaml b/config/gsa.yaml index fd9d279a..7bdc1210 100644 --- a/config/gsa.yaml +++ b/config/gsa.yaml @@ -30,6 +30,12 @@ # live in a companion config, gsa_l1.yaml, on the paper branch. name: "gsa" + +# default flipped health.enabled to false; the GSA sweeps value_per_yll, so it +# needs the health module (and the GBD-anchored baseline diet) enabled at base. +health: + enabled: true + scenarios: # Baseline scenario for calibrating piecewise food utility blocks. Consumer # values (dual variables) are taken from this enforced-baseline solve. diff --git a/config/gsa_fixed_diet.yaml b/config/gsa_fixed_diet.yaml index c55fec46..eddcde42 100644 --- a/config/gsa_fixed_diet.yaml +++ b/config/gsa_fixed_diet.yaml @@ -124,6 +124,11 @@ validation: # Keep post-hoc YLL evaluation disabled: diet is fixed, so health outputs # would be essentially constant across scenarios and just add solve time. +diet: + # Preserve the historical GBD-anchored baseline diet (anchoring used to be + # unconditional; it now follows health.enabled unless set explicitly). + anchor_groups_to_gbd: true + health: enabled: false diff --git a/config/schemas/config.schema.yaml b/config/schemas/config.schema.yaml index 4718542e..990ac11b 100644 --- a/config/schemas/config.schema.yaml +++ b/config/schemas/config.schema.yaml @@ -988,6 +988,7 @@ properties: type: object required: - baseline_age + - anchor_groups_to_gbd - source_basis - gdd_ia additionalProperties: false @@ -995,6 +996,12 @@ properties: baseline_age: type: string description: "Age group for baseline diet (e.g., 'All ages')" + anchor_groups_to_gbd: + oneOf: + - type: boolean + - type: string + enum: ["match_health"] + description: "Anchor risk-factor food-group baseline-diet totals to GBD intake exposure. true/false, or 'match_health' to follow health.enabled." fbs_override_foods: type: array uniqueItems: true diff --git a/config/tutorial/01_ghg_prices.yaml b/config/tutorial/01_ghg_prices.yaml index 77607c96..e774131d 100644 --- a/config/tutorial/01_ghg_prices.yaml +++ b/config/tutorial/01_ghg_prices.yaml @@ -57,6 +57,7 @@ validation: # This tutorial does not use health costs. health: + enabled: true # default flipped to false; this tutorial relies on health (and the GBD-anchored diet) value_per_yll: 0 # Compare all three scenarios in auto-generated plots. diff --git a/config/tutorial/02_consumer_values.yaml b/config/tutorial/02_consumer_values.yaml index 072ff4d3..861b2620 100644 --- a/config/tutorial/02_consumer_values.yaml +++ b/config/tutorial/02_consumer_values.yaml @@ -86,6 +86,7 @@ validation: use_actual_yields: true health: + enabled: true # default flipped to false; this tutorial relies on health (and the GBD-anchored diet) value_per_yll: 0 plotting: diff --git a/config/validation.yaml b/config/validation.yaml index 0066b3db..57abac8e 100644 --- a/config/validation.yaml +++ b/config/validation.yaml @@ -32,6 +32,7 @@ food_demand_calibration: enabled: false health: + enabled: true # default flipped to false; this config relies on health (and the GBD-anchored diet) value_per_yll: 0 emissions: diff --git a/docs/calibration.rst b/docs/calibration.rst index 4eeaa8c5..510c5dd3 100644 --- a/docs/calibration.rst +++ b/docs/calibration.rst @@ -20,6 +20,31 @@ The shipped ``default`` set is version-controlled so that ordinary builds don't need to re-solve anything. See :ref:`calibration-provenance` for how sets are tied to configs. +.. admonition:: Calibration is tied to the baseline diet (and thus to health/anchoring) + :class: warning + + Several artefacts (notably ``food_demand.csv``, ``food_waste.yaml`` + and ``deviation_penalty.yaml``) are fit against a *specific* baseline + diet. The baseline diet depends on whether the GBD risk-factor groups + are anchored to GBD intake, which is controlled by + ``diet.anchor_groups_to_gbd`` (default: follow ``health.enabled``); see + :ref:`current-diets-gbd-anchoring`. + + The **currently committed artefacts were fit against the GBD-anchored + baseline diet** (the historical behaviour). The calibration configs + under ``config/calibration/`` therefore pin ``anchor_groups_to_gbd: + true`` so that re-running ``tools/calibrate`` reproduces them. Because + the default model now runs *without* anchoring, the committed + calibration does not strictly match the default baseline diet -- a + modest, documented mismatch. Regenerating the calibration against the + anchoring-off default is deferred to a dedicated calibration change: + the dual-based ``cost`` and ``stability`` steps require Gurobi (the + open-source HiGHS solver does not return the constraint duals those + steps consume), and making the artefacts mode-aware so both diets can + be served from one checkout is the cleaner long-term fix. Any run that + changes ``diet.anchor_groups_to_gbd`` (or ``health.enabled``) away from + the committed setting should rerun ``tools/calibrate`` with Gurobi. + .. _calibration-enabled-generate-pattern: .. note:: diff --git a/docs/config/doc_figures.yaml b/docs/config/doc_figures.yaml index 2dafeae4..7ee43384 100644 --- a/docs/config/doc_figures.yaml +++ b/docs/config/doc_figures.yaml @@ -84,7 +84,13 @@ water: supply_scenario: "current_use" health: - enabled: false + # Documentation figures include health-cluster maps and the diet-attributable + # burden choropleth, which need the health data-prep outputs (clusters etc.). + # value_per_yll: 0 keeps health out of the objective (post-hoc evaluation), so + # the optimisation and the GBD-anchored baseline diet match the prior figures; + # enabling health just makes the prep outputs available. Needs IHME GBD data. + enabled: true + value_per_yll: 0 solving: solver: "gurobi" diff --git a/docs/config/doc_validation.yaml b/docs/config/doc_validation.yaml index 5f61d8b0..955384bc 100644 --- a/docs/config/doc_validation.yaml +++ b/docs/config/doc_validation.yaml @@ -37,6 +37,7 @@ food_demand_calibration: enabled: false health: + enabled: true # default flipped to false; this config relies on health (and the GBD-anchored diet) value_per_yll: 0 emissions: diff --git a/docs/current_diets.rst b/docs/current_diets.rst index 3d09900e..0ce92d29 100644 --- a/docs/current_diets.rst +++ b/docs/current_diets.rst @@ -65,10 +65,14 @@ Data Sources [Brauer2024]_ * **Coverage**: country-level mean intake (g/day) for the GBD dietary risk factors, adults 25+. - * **Role**: anchors the **risk-factor** food groups (fruits, - vegetables, whole_grains, legumes, nuts_seeds, red_meat) so the - baseline lines up with the same exposure basis the GBD relative-risk - functions were calibrated against. + * **Role**: *optionally* anchors the **risk-factor** food groups + (fruits, vegetables, whole_grains, legumes, nuts_seeds, red_meat) so + the baseline lines up with the same exposure basis the GBD + relative-risk functions were calibrated against. Controlled by + ``diet.anchor_groups_to_gbd`` (see + :ref:`current-diets-gbd-anchoring`). When anchoring is off, these + groups use the GDD-IA/FAOSTAT estimate like every other group and + this dataset is not needed on disk. **NHANES — What We Eat in America / FPED** * **Provider**: USDA ARS / CDC NHANES @@ -117,6 +121,113 @@ Units in the merged ``dietary_intake.csv`` distinguish ``g/day (fresh wt)`` from ``g/day (milk equiv)`` for dairy and ``g/day (refined sugar eq)`` for sugar. +.. _current-diets-gbd-anchoring: + +GBD Anchoring of Risk-Factor Groups (optional) +---------------------------------------------- + +For the six GBD dietary **risk-factor** groups (``fruits``, +``vegetables``, ``whole_grains``, ``legumes``, ``nuts_seeds``, +``red_meat``) the per-country food-group total can optionally be taken +from GBD dietary-exposure intake instead of the GDD-IA/FAOSTAT estimate. +The point of anchoring is **basis consistency with the health module**: +the GBD relative-risk dose-response curves are calibrated against GBD's +own intake exposure, so a GBD-anchored baseline makes the model's +attributable-burden numbers directly comparable to GBD's. + +Controlled by ``diet.anchor_groups_to_gbd``: + +* ``match_health`` (default) -- follow ``health.enabled``. A health run + anchors; a no-health run does not. +* ``true`` / ``false`` -- force anchoring on or off, decoupled from the + health module. + +When anchoring is on, the run needs the manually-downloaded IHME GBD +data (see :doc:`data_sources`); when off, it needs none of it. + +When to turn it on +~~~~~~~~~~~~~~~~~~~ + +Turn anchoring **on** when you care about compatibility with GBD health +metrics -- i.e. whenever the health module is enabled, or when comparing +the model's diet-attributable disease burden against GBD estimates. +Otherwise the GDD-IA/FAOSTAT baseline (anchoring off) is a perfectly +reasonable diet and avoids the manual GBD download entirely; this is the +default for a no-health run. + +How much does it change the baseline diet? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The switch is **not** cosmetic at the diet level. Across ~174 countries, +anchoring moves the per-country total of every risk-factor group, and in +about half of the affected country-group cells by more than 50 % +relative. The direction is systematic (values below are the mean +per-country shift, *anchored minus non-anchored*, g/day): + +.. list-table:: + :header-rows: 1 + :widths: 20 20 60 + + * - Group + - Mean shift (g/day) + - Direction + * - vegetables + - -56 + - anchoring **lowers** vegetable intake (e.g. ~172 -> ~116 g/day mean) + * - red_meat + - -35 + - lowers in ~98 % of countries + * - legumes + - -15 + - lowers in ~94 % of countries + * - nuts_seeds + - -6 + - lowers in ~95 % of countries + * - whole_grains + - -4.5 (mixed) + - large *relative* swings; GBD is often much higher where GDD-IA + reports near-zero whole grain + * - fruits + - +2.4 (mixed) + - roughly unchanged on average, country-specific either way + +In words: the GBD-anchored ("health-on") baseline has systematically +**less** vegetables, red meat, legumes and nuts than the GDD-IA/FAOSTAT +("health-off") baseline, and a markedly different whole-grain profile. + +Whether this matters for *physical* outcomes depends on the question. +From a global agricultural and land-use perspective the effect is +expected to be modest -- the shifts are partly redistributive within the +diet and are further absorbed by the production-stability calibration +(see :doc:`calibration`) -- though not strictly negligible. The reason +to anchor is primarily **health-metric compatibility**, not a large +expected change in land use or emissions. + +.. admonition:: Caveat: the option is not *exactly* "only the six risk groups" + + Anchoring ``whole_grains`` to GBD's narrow whole-grain definition can + discard cereal energy that GDD-IA reports more broadly. To preserve + each country's cereal energy budget, a **cereal residual fix** + reassigns that deficit to refined ``grain`` (see + ``apply_cereal_residual_fix`` in ``estimate_baseline_diet.py``), and + the kcal normalisation then treats refined ``grain`` as anchored too. + So enabling anchoring also adjusts refined ``grain`` even though it is + not itself a GBD risk factor. The fix only runs when ``whole_grains`` + is anchored, so it is inert when anchoring is off. + +.. admonition:: Calibration coupling + :class: warning + + The calibration artefacts under ``data/curated/calibration/`` are fit + against a *specific* baseline diet. The currently committed artefacts + were fit with anchoring **on** (the historical behaviour), so a + default anchoring-off run uses a slightly mismatched calibration -- + a documented limitation pending a Gurobi-based anchor-off refresh (see + :doc:`calibration`). Either way, changing ``diet.anchor_groups_to_gbd`` + (or ``health.enabled``, which drives it) changes the baseline diet, so + rerun ``tools/calibrate`` with the matching setting before trusting + results. + GDD-IA to Food Group Mapping ----------------------------- diff --git a/docs/data_sources.rst b/docs/data_sources.rst index 81e22c09..8c388543 100644 --- a/docs/data_sources.rst +++ b/docs/data_sources.rst @@ -17,12 +17,25 @@ Manual Download Checklist Several licensed datasets cannot be fetched automatically. While their use is free for non-commercial research purposes, these have to be downloaded manually or require API key registration. -**Required manual downloads:** +**Always required:** -1. Create an account with IHME and download GBD death rates as described in :ref:`ihme-gbd-mortality`. -2. Download the IHME 2019 relative risk workbook ``IHME_GBD_2019_RELATIVE_RISKS_Y2020M10D15.XLSX`` (:ref:`ihme-relative-risks`). +1. Obtain the **GDD-IA** intake CSVs by personal request to the Global Dietary Database team and place them as ``data/manually_downloaded/GDD-IA-intake_grams_{year}.csv`` and ``data/manually_downloaded/GDD-IA-intake_kcals_{year}.csv`` (:ref:`gdd-ia-dietary-intake`). + +**Required only for the health module / GBD diet anchoring:** + +The following IHME GBD datasets are needed **only** when the health +module is enabled (``health.enabled: true``) or when the baseline diet +anchors to GBD (``diet.anchor_groups_to_gbd``; see +:ref:`current-diets-gbd-anchoring`). With both off -- the default -- they +are not required and the workflow runs without them. If they are missing +while needed, the workflow stops at startup with an explicit message. + +2. Create an account with IHME and download GBD death rates as described in :ref:`ihme-gbd-mortality`. 3. Download the IHME 2023 dietary risk exposure estimates (two archives, ``IHME_GBD_2023_RISK_EXPOSURE_DIET_1`` and ``_2``) (:ref:`ihme-diet-risk-exposure`). -4. Obtain the **GDD-IA** intake CSVs by personal request to the Global Dietary Database team and place them as ``data/manually_downloaded/GDD-IA-intake_grams_{year}.csv`` and ``data/manually_downloaded/GDD-IA-intake_kcals_{year}.csv`` (:ref:`gdd-ia-dietary-intake`). + +**Optional (only to regenerate curated health inputs):** + +4. The IHME 2019 relative risk workbook ``IHME_GBD_2019_RELATIVE_RISKS_Y2020M10D15.XLSX`` (:ref:`ihme-relative-risks`) is only used by a standalone curation script to regenerate the git-tracked RR age-attenuation table; it is not consumed by the normal workflow. The one build-time credential is a free USDA FoodData Central key, used only to refresh the nutritional data (see :doc:`introduction`). Everything else is fetched from public downloads, the Zenodo land-cover mirror (:ref:`copernicus-land-cover`), or bundled data. diff --git a/docs/health.rst b/docs/health.rst index f5271140..600748ed 100644 --- a/docs/health.rst +++ b/docs/health.rst @@ -12,6 +12,27 @@ dietary choices. It begins with the epidemiological concepts that underpin the methodology, then explains the implementation strategy for embedding these nonlinear relationships into a linear optimisation framework. +.. admonition:: The health module is off by default + :class: note + + ``health.enabled`` defaults to ``false``. The health module relies on + manually-downloaded IHME GBD data (mortality rates and dietary + risk-exposure; see :doc:`data_sources`), which cannot be fetched + automatically. When health is disabled, none of that data is required + and the workflow builds, solves, and analyses without it -- the + YLL stores, health objective term, and health analysis/plots are + simply omitted, and a clear error is raised if the data is missing + while health is on. + + Enabling health also turns on **GBD anchoring of the baseline diet** + by default (``diet.anchor_groups_to_gbd: match_health``), so that the + model's attributable-burden numbers share the GBD intake basis. That + anchoring changes the baseline diet and is an independent switch -- see + :ref:`current-diets-gbd-anchoring`. Because the baseline diet feeds the + calibration, switching health (or anchoring) on or off means the + committed calibration no longer matches your run; rerun + ``tools/calibrate`` accordingly (see :doc:`calibration`). + Conceptual Framework -------------------- diff --git a/tests/config/test.yaml b/tests/config/test.yaml index 79ef399f..7bf5bbfc 100644 --- a/tests/config/test.yaml +++ b/tests/config/test.yaml @@ -25,6 +25,12 @@ scenarios: emissions: ghg_pricing_enabled: true +diet: + # Preserve the historical GBD-anchored baseline diet (anchoring used to be + # unconditional; it now follows health.enabled unless set explicitly). The + # health-off / anchoring-off path is exercised by the default config. + anchor_groups_to_gbd: true + health: enabled: false diff --git a/workflow/Snakefile b/workflow/Snakefile index 8b4e08cd..83e8fd06 100644 --- a/workflow/Snakefile +++ b/workflow/Snakefile @@ -66,7 +66,16 @@ include: "rules/common.smk" include: "rules/retrieve.smk" include: "rules/geography.smk" include: "rules/diet.smk" -include: "rules/health.smk" + + +# Health data-prep rules are only needed when the health module is enabled in +# the base config or any scenario. When off, none of the manually-downloaded +# IHME GBD data is required (see docs/health.rst and diet.anchor_groups_to_gbd). +if health_required(): + + include: "rules/health.smk" + + include: "rules/animals.smk" include: "rules/crops.smk" include: "rules/water.smk" @@ -116,13 +125,10 @@ ALL_RULE_TARGETS = [ "/{name}/plots/food_consumption_comparison.csv", "/{name}/plots/system_cost_comparison.pdf", "/{name}/plots/system_cost_comparison.csv", - "/{name}/plots/relative_risk_curves.pdf", # results plots - "/{name}/plots/scen-default/consumption_balance.pdf", "/{name}/plots/scen-default/objective_breakdown.pdf", "/{name}/plots/scen-default/objective_breakdown.csv", "/{name}/plots/scen-default/emissions_breakdown.pdf", - "/{name}/plots/scen-default/yll_global_by_cause.pdf", # "/{name}/plots/scen-default/health_risk_map.pdf", # "/{name}/plots/scen-default/health_risk_by_region.csv", # "/{name}/plots/scen-default/health_baseline_map.pdf", @@ -145,13 +151,25 @@ ALL_RULE_TARGETS = [ "/{name}/plots/scen-default/water_use_by_region.csv", # "/{name}/plots/scen-default/water_value_map.pdf", # Current not working "/{name}/plots/scen-default/food_consumption.pdf", - "/{name}/plots/scen-default/food_consumption_map.pdf", - "/{name}/plots/scen-default/food_consumption_baseline_map.pdf", # marginal damages analysis "/{name}/plots/scen-default/marginal_ghg_global.pdf", - "/{name}/plots/scen-default/marginal_yll_global.pdf", ] +# Plots that depend on the health module: the YLL/risk plots, the marginal-YLL +# damages plot, and the consumption maps/balance that aggregate countries into +# health clusters (produced by prepare_health_costs). These are only requested +# by `rule all` when health is enabled; the model still builds, solves, and +# produces every other output with health off and no IHME GBD data on disk. +if health_required(): + ALL_RULE_TARGETS += [ + "/{name}/plots/relative_risk_curves.pdf", + "/{name}/plots/scen-default/consumption_balance.pdf", + "/{name}/plots/scen-default/yll_global_by_cause.pdf", + "/{name}/plots/scen-default/food_consumption_map.pdf", + "/{name}/plots/scen-default/food_consumption_baseline_map.pdf", + "/{name}/plots/scen-default/marginal_yll_global.pdf", + ] + # Solve all configured scenarios without triggering plotting targets. SOLVE_ALL_SCENARIOS_TARGETS = [ f"/{{name}}/solved/model_scen-{scenario}.nc" diff --git a/workflow/rules/analysis.smk b/workflow/rules/analysis.smk index c606c619..95e59fb0 100644 --- a/workflow/rules/analysis.smk +++ b/workflow/rules/analysis.smk @@ -176,12 +176,34 @@ else: network="/{name}/solved/model_scen-{scenario}.nc", food_groups="data/curated/food_groups.csv", m49_codes="data/curated/M49-codes.csv", - risk_breakpoints="/{name}/health/risk_breakpoints.csv", - health_cluster_cause="/{name}/health/cluster_cause_baseline.csv", - health_cause_log="/{name}/health/cause_log_breakpoints.csv", - health_clusters="/{name}/health/country_clusters.csv", population="/{name}/population.csv", - tmrel="/{name}/health/tmrel.csv", + # Health processing inputs only when this scenario enables health; + # otherwise analyze_model writes empty health outputs. + risk_breakpoints=lambda w: ( + f"/{w.name}/health/risk_breakpoints.csv" + if get_effective_config(w.scenario)["health"]["enabled"] + else [] + ), + health_cluster_cause=lambda w: ( + f"/{w.name}/health/cluster_cause_baseline.csv" + if get_effective_config(w.scenario)["health"]["enabled"] + else [] + ), + health_cause_log=lambda w: ( + f"/{w.name}/health/cause_log_breakpoints.csv" + if get_effective_config(w.scenario)["health"]["enabled"] + else [] + ), + health_clusters=lambda w: ( + f"/{w.name}/health/country_clusters.csv" + if get_effective_config(w.scenario)["health"]["enabled"] + else [] + ), + tmrel=lambda w: ( + f"/{w.name}/health/tmrel.csv" + if get_effective_config(w.scenario)["health"]["enabled"] + else [] + ), analysis_scripts=_ANALYSIS_SCRIPTS, params: ghg_price=lambda w: get_effective_config(w.scenario)["emissions"][ @@ -189,6 +211,9 @@ else: ], ch4_gwp=config["emissions"]["ch4_to_co2_factor"], n2o_gwp=config["emissions"]["n2o_to_co2_factor"], + health_enabled=lambda w: get_effective_config(w.scenario)["health"][ + "enabled" + ], value_per_yll=lambda w: get_effective_config(w.scenario)["health"][ "value_per_yll" ], diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk index d0bb3d30..64730a8a 100644 --- a/workflow/rules/common.smk +++ b/workflow/rules/common.smk @@ -70,6 +70,85 @@ def get_effective_config(scenario_name): return eff_config +def health_required(): + """True if the health module is enabled in the base config or any scenario. + + The build step is scenario-independent, so health data-prep rules, health + stores, and downstream health analysis/plots are needed whenever *any* + configured scenario (or the base config) enables health. health.enabled is + a solve-time-overridable key. + """ + if config["health"]["enabled"]: + return True + return any(get_effective_config(s)["health"]["enabled"] for s in list_scenarios()) + + +def gbd_anchoring_enabled(): + """Resolve diet.anchor_groups_to_gbd to a bool. + + The sentinel "match_health" follows the base config's health.enabled. + Anchoring is applied in the scenario-independent baseline-diet prep, so the + sentinel resolves against the base config (not per-scenario). + """ + value = config["diet"]["anchor_groups_to_gbd"] + if value == "match_health": + return bool(config["health"]["enabled"]) + return bool(value) + + +def gbd_data_required(): + """True if the manually-downloaded IHME GBD data is needed by this run. + + The GBD intake/mortality prep (and thus the IHME source files) is required + whenever the baseline diet is anchored to GBD or the health module is + enabled in any scenario. + """ + return gbd_anchoring_enabled() or health_required() + + +def assert_gbd_data_available(): + """Fail early with actionable guidance if GBD data is needed but absent. + + Snakemake would otherwise report a terse "missing input" deep in the DAG. + Only enforced when gbd_data_required(); a health-off, anchoring-off run + needs none of these files. + """ + if not gbd_data_required(): + return + year = config["baseline_year"] + required = { + f"data/manually_downloaded/IHME-GBD_2023-death-rates-{year}.csv": ( + "IHME GBD mortality / national-location list" + ), + "data/manually_downloaded/IHME_GBD_2023_RISK_EXPOSURE_DIET_1": ( + "IHME GBD dietary risk-exposure archive (part 1)" + ), + "data/manually_downloaded/IHME_GBD_2023_RISK_EXPOSURE_DIET_2": ( + "IHME GBD dietary risk-exposure archive (part 2)" + ), + } + missing = [(p, desc) for p, desc in required.items() if not Path(p).exists()] + if not missing: + return + reasons = [] + if health_required(): + reasons.append("health.enabled is true (base config or a scenario)") + if gbd_anchoring_enabled(): + reasons.append("diet.anchor_groups_to_gbd resolves to true") + listing = "\n".join(f" - {p} ({desc})" for p, desc in missing) + raise FileNotFoundError( + "This run needs the manually-downloaded IHME GBD data because " + + " and ".join(reasons) + + ", but the following are missing:\n" + + listing + + "\n\nEither place the files (see data/manually_downloaded/README.md " + "for the GBD Results Tool queries), or run without GBD data by setting " + "health.enabled: false and diet.anchor_groups_to_gbd: false in your " + "config. Note that disabling anchoring changes the baseline diet " + "(see docs/current_diets.rst)." + ) + + # SOLVE_TIME_CONFIG_PREFIXES, _leaf_keys, _is_solve_time_key, and # validate_scenario_overrides are imported from workflow.scripts.solve_namespace # above so cluster manifest export, in-process calibration drivers, and this @@ -78,6 +157,7 @@ def get_effective_config(scenario_name): validate_scenario_overrides(load_scenario_defs()) validate_scenario_config_schemas(config, load_scenario_defs(), Path.cwd()) +assert_gbd_data_available() def scenario_override_hash(scenario_name): diff --git a/workflow/rules/deviation_penalty.smk b/workflow/rules/deviation_penalty.smk index c9c4266b..21801c8c 100644 --- a/workflow/rules/deviation_penalty.smk +++ b/workflow/rules/deviation_penalty.smk @@ -22,13 +22,41 @@ if _dp_cal_cfg["generate"]: baseline_diet=f"/{name}/baseline_diet.csv", m49="data/curated/M49-codes.csv", food_groups="data/curated/food_groups.csv", - health_risk_breakpoints=f"/{name}/health/risk_breakpoints.csv", - health_cluster_cause=f"/{name}/health/cluster_cause_baseline.csv", - health_cause_log=f"/{name}/health/cause_log_breakpoints.csv", - health_cluster_summary=f"/{name}/health/cluster_summary.csv", - health_clusters=f"/{name}/health/country_clusters.csv", - health_tmrel=f"/{name}/health/tmrel.csv", - health_cluster_risk_baseline=f"/{name}/health/cluster_risk_baseline.csv", + # Health inputs only when health is enabled (the solve prices health + # only then); omitted otherwise so no IHME GBD data is required. + health_risk_breakpoints=( + f"/{name}/health/risk_breakpoints.csv" + if health_required() + else [] + ), + health_cluster_cause=( + f"/{name}/health/cluster_cause_baseline.csv" + if health_required() + else [] + ), + health_cause_log=( + f"/{name}/health/cause_log_breakpoints.csv" + if health_required() + else [] + ), + health_cluster_summary=( + f"/{name}/health/cluster_summary.csv" + if health_required() + else [] + ), + health_clusters=( + f"/{name}/health/country_clusters.csv" + if health_required() + else [] + ), + health_tmrel=( + f"/{name}/health/tmrel.csv" if health_required() else [] + ), + health_cluster_risk_baseline=( + f"/{name}/health/cluster_risk_baseline.csv" + if health_required() + else [] + ), nutrition="data/curated/nutrition.csv", output: calibrated_yaml=_dp_cal_cfg["calibrated_yaml"], diff --git a/workflow/rules/diet.smk b/workflow/rules/diet.smk index 6e039723..2d9b1464 100644 --- a/workflow/rules/diet.smk +++ b/workflow/rules/diet.smk @@ -317,7 +317,14 @@ rule estimate_baseline_diet: """ input: dietary_intake="/{name}/dietary_intake.csv", - gbd_exposure="/{name}/gbd_food_group_intake.csv", + # Only required when the baseline diet anchors risk-factor groups to + # GBD intake exposure; otherwise the GDD/FAOSTAT estimate is used and + # the IHME GBD data is not needed. See diet.anchor_groups_to_gbd. + gbd_exposure=( + "/{name}/gbd_food_group_intake.csv" + if gbd_anchoring_enabled() + else [] + ), kcal_target="/{name}/gdd_ia_kcal_target.csv", nutrition="data/curated/nutrition.csv", fbs_items="/{name}/faostat_fbs_items.csv", @@ -344,7 +351,10 @@ rule estimate_baseline_diet: food_groups_included=config["food_groups"]["included"], byproducts=config["byproducts"], fbs_override_foods=config["diet"]["fbs_override_foods"], - gbd_anchored_groups=config["health"]["risk_factors"], + # Empty when anchoring is off -> baseline diet uses GDD/FAOSTAT only. + gbd_anchored_groups=( + config["health"]["risk_factors"] if gbd_anchoring_enabled() else [] + ), source_basis=config["diet"]["source_basis"], weight_conversion=config["weight_conversion"], # Used to drop foods whose entire production graph is unreachable diff --git a/workflow/rules/model.smk b/workflow/rules/model.smk index ff717356..033322ab 100644 --- a/workflow/rules/model.smk +++ b/workflow/rules/model.smk @@ -149,7 +149,12 @@ rule build_model: population="/{name}/population.csv", baseline_diet="/{name}/dietary_intake.csv", baseline_diet_validation="/{name}/baseline_diet_validation.csv", - baseline_diet_risk_comparison="/{name}/baseline_diet_risk_comparison.csv", + # GBD baseline-diet consistency check; only built when health is on. + baseline_diet_risk_comparison=( + "/{name}/baseline_diet_risk_comparison.csv" + if health_required() + else [] + ), food_loss_waste="/{name}/food_loss_waste.csv", costs="/{name}/faostat_crop_costs.csv", animal_costs="/{name}/animal_costs.csv", @@ -161,9 +166,23 @@ rule build_model: faostat_pasture_area="/{name}/faostat_pasture_area.csv", current_grassland_area="/{name}/luc/current_grassland_area_by_class.csv", grazing_only_land="/{name}/land_grazing_only_by_class.csv", - health_cluster_summary="/{name}/health/cluster_summary.csv", - health_cluster_cause="/{name}/health/cluster_cause_baseline.csv", - health_clusters="/{name}/health/country_clusters.csv", + # Health-cluster stores are only added when health is enabled (in the + # base config or any scenario, since the build is scenario-independent). + health_cluster_summary=( + "/{name}/health/cluster_summary.csv" + if health_required() + else [] + ), + health_cluster_cause=( + "/{name}/health/cluster_cause_baseline.csv" + if health_required() + else [] + ), + health_clusters=( + "/{name}/health/country_clusters.csv" + if health_required() + else [] + ), build_scripts=expand( "workflow/scripts/build_model/{script}", script=[ @@ -204,6 +223,9 @@ rule build_model: validation=config["validation"], deviation_penalty=config["deviation_penalty"], netcdf=config["netcdf"], + # Add health-cluster stores when health is enabled in the base config or + # any scenario (the build is shared across scenarios). + health_enabled=health_required(), output: network="/{name}/build/model.nc", group: @@ -227,19 +249,29 @@ def solve_model_inputs(w): inputs = { "network": f"/{w.name}/build/model.nc", "m49": "data/curated/M49-codes.csv", - "health_risk_breakpoints": f"/{w.name}/health/risk_breakpoints.csv", - "health_cluster_cause": f"/{w.name}/health/cluster_cause_baseline.csv", - "health_cause_log": f"/{w.name}/health/cause_log_breakpoints.csv", - "health_cluster_summary": f"/{w.name}/health/cluster_summary.csv", - "health_clusters": f"/{w.name}/health/country_clusters.csv", - "health_tmrel": f"/{w.name}/health/tmrel.csv", - "health_cluster_risk_baseline": f"/{w.name}/health/cluster_risk_baseline.csv", "food_groups": "data/curated/food_groups.csv", "baseline_diet": f"/{w.name}/baseline_diet.csv", } - # Add food incentives input if enabled for this scenario eff_cfg = get_effective_config(w.scenario) + + # Health processing inputs are only consumed when this scenario enables + # health (add_health_objective / post-hoc evaluation). Omitted otherwise so + # the solve needs none of the GBD-derived health artefacts. + if eff_cfg["health"]["enabled"]: + inputs.update( + { + "health_risk_breakpoints": f"/{w.name}/health/risk_breakpoints.csv", + "health_cluster_cause": f"/{w.name}/health/cluster_cause_baseline.csv", + "health_cause_log": f"/{w.name}/health/cause_log_breakpoints.csv", + "health_cluster_summary": f"/{w.name}/health/cluster_summary.csv", + "health_clusters": f"/{w.name}/health/country_clusters.csv", + "health_tmrel": f"/{w.name}/health/tmrel.csv", + "health_cluster_risk_baseline": f"/{w.name}/health/cluster_risk_baseline.csv", + } + ) + + # Add food incentives input if enabled for this scenario if eff_cfg["food_incentives"]["enabled"]: sources = eff_cfg["food_incentives"]["sources"] if not sources: diff --git a/workflow/scripts/analysis/analyze_model.py b/workflow/scripts/analysis/analyze_model.py index 4b9e65ac..58af6d64 100644 --- a/workflow/scripts/analysis/analyze_model.py +++ b/workflow/scripts/analysis/analyze_model.py @@ -82,18 +82,19 @@ def run_analysis( output_paths: dict[str, str], food_groups_path: str, m49_codes_path: str, - risk_breakpoints_path: str, - health_cluster_cause_path: str, - health_cause_log_path: str, - health_clusters_path: str, population_path: str, - tmrel_path: str, ghg_price: float, ch4_gwp: float, n2o_gwp: float, value_per_yll: float, health_risk_factors: list[str], logger: logging.Logger, + health_enabled: bool, + risk_breakpoints_path: str | None = None, + health_cluster_cause_path: str | None = None, + health_cause_log_path: str | None = None, + health_clusters_path: str | None = None, + tmrel_path: str | None = None, ) -> None: """Run all analysis extractions on a solved network. @@ -164,25 +165,34 @@ def run_analysis( ghg_attribution_totals = compute_ghg_totals(ghg_attribution) # --- Health impacts --- - logger.info("Extracting health impacts...") - risk_factors = list(health_risk_factors) - health_data = load_health_data( - { - "risk_breakpoints": risk_breakpoints_path, - "health_cluster_cause": health_cluster_cause_path, - "health_cause_log": health_cause_log_path, - "health_clusters": health_clusters_path, - "population": population_path, - "tmrel": tmrel_path, - } - ) - health_marginals = compute_health_marginals(n, health_data, risk_factors) - health_marginals = add_health_monetary_value(health_marginals, value_per_yll) - health_marginals = health_marginals.sort_values( - ["country", "food_group"] - ).reset_index(drop=True) - health_totals = extract_yll_totals(n) - health_attribution = compute_health_attribution(n, health_data, risk_factors) + # Only computed when health is enabled for this scenario; otherwise the + # network carries no YLL stores and the GBD-derived health processing + # inputs are absent, so the health outputs are written empty. + if health_enabled: + logger.info("Extracting health impacts...") + risk_factors = list(health_risk_factors) + health_data = load_health_data( + { + "risk_breakpoints": risk_breakpoints_path, + "health_cluster_cause": health_cluster_cause_path, + "health_cause_log": health_cause_log_path, + "health_clusters": health_clusters_path, + "population": population_path, + "tmrel": tmrel_path, + } + ) + health_marginals = compute_health_marginals(n, health_data, risk_factors) + health_marginals = add_health_monetary_value(health_marginals, value_per_yll) + health_marginals = health_marginals.sort_values( + ["country", "food_group"] + ).reset_index(drop=True) + health_totals = extract_yll_totals(n) + health_attribution = compute_health_attribution(n, health_data, risk_factors) + else: + logger.info("Health disabled for this scenario; writing empty health outputs.") + health_marginals = pd.DataFrame() + health_totals = pd.DataFrame() + health_attribution = pd.DataFrame() # Write all outputs output_dir = Path(output_paths["crop_production"]).parent @@ -267,23 +277,33 @@ def main() -> None: and getattr(snakemake.output, attr).endswith(".parquet") } + health_enabled = bool(snakemake.params.health_enabled) run_analysis( n, output_paths=output_paths, food_groups_path=snakemake.input.food_groups, m49_codes_path=snakemake.input.m49_codes, - risk_breakpoints_path=snakemake.input.risk_breakpoints, - health_cluster_cause_path=snakemake.input.health_cluster_cause, - health_cause_log_path=snakemake.input.health_cause_log, - health_clusters_path=snakemake.input.health_clusters, population_path=snakemake.input.population, - tmrel_path=snakemake.input.tmrel, ghg_price=float(snakemake.params.ghg_price), ch4_gwp=float(snakemake.params.ch4_gwp), n2o_gwp=float(snakemake.params.n2o_gwp), value_per_yll=float(snakemake.params.value_per_yll), health_risk_factors=list(snakemake.params.health_risk_factors), logger=logger, + health_enabled=health_enabled, + risk_breakpoints_path=snakemake.input.risk_breakpoints + if health_enabled + else None, + health_cluster_cause_path=( + snakemake.input.health_cluster_cause if health_enabled else None + ), + health_cause_log_path=( + snakemake.input.health_cause_log if health_enabled else None + ), + health_clusters_path=snakemake.input.health_clusters + if health_enabled + else None, + tmrel_path=snakemake.input.tmrel if health_enabled else None, ) diff --git a/workflow/scripts/build_model.py b/workflow/scripts/build_model.py index 091a4065..6cb5a24f 100644 --- a/workflow/scripts/build_model.py +++ b/workflow/scripts/build_model.py @@ -1189,25 +1189,31 @@ def _apply_yield_corrections(corr_df: pd.DataFrame, source_label: str) -> None: hub_centers=hub_centers, ) - health.add_health_stores( - n, - snakemake.input.health_cluster_summary, - snakemake.input.health_cluster_cause, - snakemake.config["health"], - ) + # Health-cluster stores are only added when health is enabled (in the base + # config or any scenario). When disabled, the health processing inputs are + # absent and no health data is needed on disk. + if snakemake.params.health_enabled: + health.add_health_stores( + n, + snakemake.input.health_cluster_summary, + snakemake.input.health_cluster_cause, + snakemake.config["health"], + ) - # Compute and store health cluster populations from country populations - cluster_map_df = read_csv(snakemake.input.health_clusters) - cluster_lookup = ( - cluster_map_df.set_index("country_iso3")["health_cluster"].astype(int).to_dict() - ) - pop_df = population.reset_index() - pop_df.columns = ["iso3", "population"] - pop_df["cluster"] = pop_df["iso3"].str.upper().map(cluster_lookup) - pop_df = pop_df.dropna(subset=["cluster"]) - pop_df["cluster"] = pop_df["cluster"].astype(int) - cluster_pop = pop_df.groupby("cluster")["population"].sum().to_dict() - n.meta["population"]["health_cluster"] = cluster_pop + # Compute and store health cluster populations from country populations + cluster_map_df = read_csv(snakemake.input.health_clusters) + cluster_lookup = ( + cluster_map_df.set_index("country_iso3")["health_cluster"] + .astype(int) + .to_dict() + ) + pop_df = population.reset_index() + pop_df.columns = ["iso3", "population"] + pop_df["cluster"] = pop_df["iso3"].str.upper().map(cluster_lookup) + pop_df = pop_df.dropna(subset=["cluster"]) + pop_df["cluster"] = pop_df["cluster"].astype(int) + cluster_pop = pop_df.groupby("cluster")["population"].sum().to_dict() + n.meta["population"]["health_cluster"] = cluster_pop # Store build-time regional_limit in metadata so solve_model can rescale n.meta["land_regional_limit"] = float(land_cfg["regional_limit"]) diff --git a/workflow/scripts/estimate_baseline_diet.py b/workflow/scripts/estimate_baseline_diet.py index 4d72b505..acc998f5 100644 --- a/workflow/scripts/estimate_baseline_diet.py +++ b/workflow/scripts/estimate_baseline_diet.py @@ -222,20 +222,27 @@ def load_group_totals( gdd_totals = intake.set_index(["country", "food_group"])["value"] # Load GBD dietary risk exposure (aggregate duplicates if any) and - # convert to the model's basis where it differs. - gbd = pd.read_csv(gbd_exposure_path) - gbd = convert_intake( - gbd, - source="gbd", - value_column="consumption_g_per_day", - group_column="food_group", - country_column="country", - source_basis=source_basis, - source_basis_country_overrides=source_basis_country_overrides, - target_basis_by_key=group_basis, - factors=weight_conversion, - ) - gbd_totals = gbd.groupby(["country", "food_group"])["consumption_g_per_day"].mean() + # convert to the model's basis where it differs. Skipped entirely when no + # groups are anchored (anchoring off): the GBD intake file is then not an + # input and the GDD/FAOSTAT estimate is used for every group. + if gbd_anchored_groups: + gbd = pd.read_csv(gbd_exposure_path) + gbd = convert_intake( + gbd, + source="gbd", + value_column="consumption_g_per_day", + group_column="food_group", + country_column="country", + source_basis=source_basis, + source_basis_country_overrides=source_basis_country_overrides, + target_basis_by_key=group_basis, + factors=weight_conversion, + ) + gbd_totals = gbd.groupby(["country", "food_group"])[ + "consumption_g_per_day" + ].mean() + else: + gbd_totals = pd.Series(dtype=float) # Build combined group totals: GBD-anchored groups prefer GBD when # available, else fall back to GDD/FAOSTAT; non-anchored groups @@ -261,7 +268,11 @@ def load_group_totals( } ) - log_gdd_gbd_agreement(gdd_totals, gbd_totals, gbd_anchored_groups) + # Cross-validation logging only applies when groups are anchored to GBD; + # with anchoring off gbd_totals is empty (no MultiIndex) and there is + # nothing to compare. + if gbd_anchored_groups: + log_gdd_gbd_agreement(gdd_totals, gbd_totals, gbd_anchored_groups) return pd.DataFrame(results) @@ -1547,13 +1558,18 @@ def main(): # gdd_ia_kcal_target.csv. kcal_target_df = pd.read_csv(kcal_target_path) nutrition_df = pd.read_csv(nutrition_path) - kcal_per_g_group = build_kcal_per_g_group(food_groups_df, nutrition_df) kcal_per_g_food = build_kcal_per_g_food(nutrition_df) - group_totals = apply_cereal_residual_fix( - group_totals, - kcal_target_df, - kcal_per_g_group, - ) + # The cereal residual fix only compensates for energy lost to GBD's narrow + # whole_grain anchor, so it applies only when whole_grains is anchored. + # With anchoring off the GDD/FAOSTAT whole_grains total already carries the + # broad cereal energy and no reallocation to refined grain is needed. + if WHOLE_GRAINS_GROUP in gbd_anchored_groups: + kcal_per_g_group = build_kcal_per_g_group(food_groups_df, nutrition_df) + group_totals = apply_cereal_residual_fix( + group_totals, + kcal_target_df, + kcal_per_g_group, + ) # Step 1c is performed after per-food disaggregation (Step 3) so the # kcal accounting uses food-level energy densities rather than a diff --git a/workflow/scripts/solve_and_analyze_model.py b/workflow/scripts/solve_and_analyze_model.py index 86991842..41545063 100644 --- a/workflow/scripts/solve_and_analyze_model.py +++ b/workflow/scripts/solve_and_analyze_model.py @@ -39,23 +39,33 @@ def main() -> None: and getattr(snakemake.output, attr).endswith(".parquet") } + health_enabled = bool(snakemake.params.health_enabled) run_analysis( n, output_paths=output_paths, food_groups_path=snakemake.input.food_groups, m49_codes_path=snakemake.input.m49, - risk_breakpoints_path=snakemake.input.health_risk_breakpoints, - health_cluster_cause_path=snakemake.input.health_cluster_cause, - health_cause_log_path=snakemake.input.health_cause_log, - health_clusters_path=snakemake.input.health_clusters, population_path=snakemake.input.population, - tmrel_path=snakemake.input.health_tmrel, ghg_price=float(snakemake.params.ghg_price), ch4_gwp=float(snakemake.params.ch4_gwp), n2o_gwp=float(snakemake.params.n2o_gwp), value_per_yll=float(snakemake.params.health_value_per_yll), health_risk_factors=list(snakemake.params.health_risk_factors), logger=logger, + health_enabled=health_enabled, + risk_breakpoints_path=( + snakemake.input.health_risk_breakpoints if health_enabled else None + ), + health_cluster_cause_path=( + snakemake.input.health_cluster_cause if health_enabled else None + ), + health_cause_log_path=( + snakemake.input.health_cause_log if health_enabled else None + ), + health_clusters_path=( + snakemake.input.health_clusters if health_enabled else None + ), + tmrel_path=snakemake.input.health_tmrel if health_enabled else None, ) logger.info("Solve-and-analyze complete.") diff --git a/workflow/scripts/solve_namespace.py b/workflow/scripts/solve_namespace.py index 7b82a3e0..5f952196 100644 --- a/workflow/scripts/solve_namespace.py +++ b/workflow/scripts/solve_namespace.py @@ -319,23 +319,37 @@ def rp(path: str) -> str: inputs: dict = { "network": rp("/{name}/build/model.nc"), "m49": "data/curated/M49-codes.csv", - "health_risk_breakpoints": rp( - "/{name}/health/risk_breakpoints.csv" - ), - "health_cluster_cause": rp( - "/{name}/health/cluster_cause_baseline.csv" - ), - "health_cause_log": rp("/{name}/health/cause_log_breakpoints.csv"), - "health_cluster_summary": rp("/{name}/health/cluster_summary.csv"), - "health_clusters": rp("/{name}/health/country_clusters.csv"), - "health_tmrel": rp("/{name}/health/tmrel.csv"), - "health_cluster_risk_baseline": rp( - "/{name}/health/cluster_risk_baseline.csv" - ), "food_groups": "data/curated/food_groups.csv", "baseline_diet": rp("/{name}/baseline_diet.csv"), } + # Health processing inputs only when this scenario enables health (mirrors + # solve_model_inputs in workflow/rules/model.smk). + if eff["health"]["enabled"]: + inputs.update( + { + "health_risk_breakpoints": rp( + "/{name}/health/risk_breakpoints.csv" + ), + "health_cluster_cause": rp( + "/{name}/health/cluster_cause_baseline.csv" + ), + "health_cause_log": rp( + "/{name}/health/cause_log_breakpoints.csv" + ), + "health_cluster_summary": rp( + "/{name}/health/cluster_summary.csv" + ), + "health_clusters": rp( + "/{name}/health/country_clusters.csv" + ), + "health_tmrel": rp("/{name}/health/tmrel.csv"), + "health_cluster_risk_baseline": rp( + "/{name}/health/cluster_risk_baseline.csv" + ), + } + ) + if eff["food_incentives"]["enabled"]: sources = eff["food_incentives"]["sources"] if not sources: From 0c8356e7c132e16c9fa1e9861cffbda7891250ff Mon Sep 17 00:00:00 2001 From: Koen van Greevenbroek Date: Wed, 1 Jul 2026 15:36:04 -0700 Subject: [PATCH 2/8] fix: record resolved GBD anchoring in calibration provenance and pin it across calibration steps The provenance snapshot stored diet.anchor_groups_to_gbd verbatim, so the sentinel "match_health" -- which resolves through the solve-time-exempt health.enabled -- let two configs with different resolved anchoring (and thus different baseline diets) stamp identically. Snapshot the resolved boolean instead, via a shared resolve_gbd_anchoring() helper. The calibration step configs previously pinned anchoring on to reproduce the committed anchored artefacts, while the stamp is computed from the base config alone; the step configs also override health.enabled for solver performance, which would re-resolve the sentinel per step and mix baseline diets within one chain. Drop the pins and have tools/calibrate resolve anchoring from the base config once, pinning it in every step via a config overlay. --- config/calibration/cost.yaml | 12 ++++-------- config/calibration/feed.yaml | 5 ----- config/calibration/food_demand.yaml | 5 ----- config/calibration/food_waste.yaml | 5 ----- config/calibration/stability.yaml | 5 ----- tools/calibrate | 16 ++++++++++++++-- workflow/rules/common.smk | 13 +++---------- workflow/scripts/solve_namespace.py | 15 +++++++++++++++ .../scripts/write_calibration_provenance.py | 18 +++++++++++++++++- workflow/validation/calibration_provenance.py | 9 ++++++++- 10 files changed, 61 insertions(+), 42 deletions(-) diff --git a/config/calibration/cost.yaml b/config/calibration/cost.yaml index b4169819..df1d2529 100644 --- a/config/calibration/cost.yaml +++ b/config/calibration/cost.yaml @@ -128,14 +128,10 @@ validation: water: supply_scenario: "current_use" -# The committed calibration artefacts under data/curated/calibration/ were -# generated against the GBD-anchored baseline diet, so anchoring is pinned on -# here to reproduce them. Regenerating against the new default (anchoring off) -# is deferred to a dedicated calibration PR -- the dual-based cost/stability -# steps require Gurobi. See docs/calibration.rst. -diet: - anchor_groups_to_gbd: true - +# NB: diet.anchor_groups_to_gbd is pinned to the base config's resolved value +# by tools/calibrate (via a config overlay), so disabling health here for +# solver performance does not change the baseline diet the artefacts are fit +# against. health: enabled: false diff --git a/config/calibration/feed.yaml b/config/calibration/feed.yaml index 994ea0d3..abcf40b6 100644 --- a/config/calibration/feed.yaml +++ b/config/calibration/feed.yaml @@ -75,11 +75,6 @@ exogenous_feed_calibration: generate: true # Generate calibration from solved model scenario: feed_uncal -# Anchoring pinned on to reproduce the committed (anchored) calibration -# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). -diet: - anchor_groups_to_gbd: true - health: value_per_yll: 0 diff --git a/config/calibration/food_demand.yaml b/config/calibration/food_demand.yaml index 48cbce7e..a140a4a0 100644 --- a/config/calibration/food_demand.yaml +++ b/config/calibration/food_demand.yaml @@ -68,11 +68,6 @@ food_demand_calibration: generate: true # Generate calibration from solved model scenario: "food_demand_uncal" -# Anchoring pinned on to reproduce the committed (anchored) calibration -# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). -diet: - anchor_groups_to_gbd: true - health: value_per_yll: 0 diff --git a/config/calibration/food_waste.yaml b/config/calibration/food_waste.yaml index 8b9a453f..d981bcd0 100644 --- a/config/calibration/food_waste.yaml +++ b/config/calibration/food_waste.yaml @@ -66,11 +66,6 @@ food_loss_waste_calibration: generate: true # Generate calibration from solved model scenario: "food_waste_uncal" -# Anchoring pinned on to reproduce the committed (anchored) calibration -# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). -diet: - anchor_groups_to_gbd: true - health: value_per_yll: 0 diff --git a/config/calibration/stability.yaml b/config/calibration/stability.yaml index 654431ac..12bdb0c0 100644 --- a/config/calibration/stability.yaml +++ b/config/calibration/stability.yaml @@ -32,11 +32,6 @@ baseline_year: 2020 # consumer values reflect raw revealed preferences). emissions: ghg_price: 0 -# Anchoring pinned on to reproduce the committed (anchored) calibration -# artefacts; anchor-off regeneration is deferred to a dedicated PR (see cost.yaml). -diet: - anchor_groups_to_gbd: true - health: value_per_yll: 0 diff --git a/tools/calibrate b/tools/calibrate index d1f4cbc2..35169f03 100755 --- a/tools/calibrate +++ b/tools/calibrate @@ -84,6 +84,18 @@ fi CAL_DIR="data/curated/calibration/${SOURCE}" echo "[calibrate] artefact set: ${SOURCE} (${CAL_DIR})" +# The artefacts are fit against the base config's baseline diet, which depends +# on the resolved diet.anchor_groups_to_gbd (the "match_health" sentinel +# follows health.enabled). The calibration step configs override health.enabled +# for solver performance, which would otherwise re-resolve the sentinel per +# step and mix baseline diets within one chain. Resolve anchoring from the +# base config once and pin it explicitly in every step via a config overlay. +ANCHORING=$(pixi run python "$PROVENANCE_SCRIPT" "${BASE_ARGS[@]}" --print-anchoring | tail -n1) +ANCHOR_OVERLAY=$(mktemp -t calibrate-anchoring-XXXXXX.yaml) +trap 'rm -f "$ANCHOR_OVERLAY"' EXIT +printf 'diet:\n anchor_groups_to_gbd: %s\n' "$ANCHORING" > "$ANCHOR_OVERLAY" +echo "[calibrate] baseline-diet GBD anchoring: ${ANCHORING}" + # Non-default sources get their own processing/results tree so repeated # calibrations against different bases don't thrash each other's builds. NAME_OVERRIDE=() @@ -169,7 +181,7 @@ run_step() { inject_env local cfgfiles=() [[ -n "$BASE_CONFIG" ]] && cfgfiles+=("$BASE_CONFIG") - cfgfiles+=("$configfile") + cfgfiles+=("$configfile" "$ANCHOR_OVERLAY") echo "[calibrate] $label -- tools/smk --configfile ${cfgfiles[*]} ${NAME_OVERRIDE[*]} ${PASSTHROUGH_ARR[*]} -- ${TARGETS_ARR[*]}" tools/smk --configfile "${cfgfiles[@]}" "${NAME_OVERRIDE[@]}" "${PASSTHROUGH_ARR[@]}" -- "${TARGETS_ARR[@]}" } @@ -181,7 +193,7 @@ check_step() { inject_env local cfgfiles=() [[ -n "$BASE_CONFIG" ]] && cfgfiles+=("$BASE_CONFIG") - cfgfiles+=("$configfile") + cfgfiles+=("$configfile" "$ANCHOR_OVERLAY") local out if ! out=$(tools/smk --configfile "${cfgfiles[@]}" "${NAME_OVERRIDE[@]}" "${PASSTHROUGH_ARR[@]}" --dry-run --quiet -- "${TARGETS_ARR[@]}" 2>&1); then printf ' [error] %-10s (dry-run failed; see output below)\n' "$label" diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk index 64730a8a..f4ba0629 100644 --- a/workflow/rules/common.smk +++ b/workflow/rules/common.smk @@ -20,6 +20,7 @@ from workflow.scripts.solve_namespace import ( _is_solve_time_key, _leaf_keys, deviation_penalty_uses_calibrated, + resolve_gbd_anchoring, validate_scenario_config_schemas, validate_scenario_overrides, ) @@ -84,16 +85,8 @@ def health_required(): def gbd_anchoring_enabled(): - """Resolve diet.anchor_groups_to_gbd to a bool. - - The sentinel "match_health" follows the base config's health.enabled. - Anchoring is applied in the scenario-independent baseline-diet prep, so the - sentinel resolves against the base config (not per-scenario). - """ - value = config["diet"]["anchor_groups_to_gbd"] - if value == "match_health": - return bool(config["health"]["enabled"]) - return bool(value) + """Resolve diet.anchor_groups_to_gbd for this run (see resolve_gbd_anchoring).""" + return resolve_gbd_anchoring(config) def gbd_data_required(): diff --git a/workflow/scripts/solve_namespace.py b/workflow/scripts/solve_namespace.py index 5f952196..95968390 100644 --- a/workflow/scripts/solve_namespace.py +++ b/workflow/scripts/solve_namespace.py @@ -72,6 +72,21 @@ def _is_solve_time_key(key): return any(key == p or key.startswith(p + ".") for p in SOLVE_TIME_CONFIG_PREFIXES) +def resolve_gbd_anchoring(config: dict) -> bool: + """Resolve diet.anchor_groups_to_gbd to a bool. + + The sentinel "match_health" follows health.enabled. Anchoring is applied + in the scenario-independent baseline-diet prep, so it always resolves + against the base config (never per-scenario). Single source of truth + shared by the Snakemake rules (workflow/rules/common.smk), the calibration + provenance snapshot, and tools/calibrate. + """ + value = config["diet"]["anchor_groups_to_gbd"] + if value == "match_health": + return bool(config["health"]["enabled"]) + return bool(value) + + def validate_scenario_overrides(scenario_defs: dict) -> None: """Raise if any scenario in scenario_defs overrides a structural config key. diff --git a/workflow/scripts/write_calibration_provenance.py b/workflow/scripts/write_calibration_provenance.py index 2cc5aebb..b1cf54b8 100644 --- a/workflow/scripts/write_calibration_provenance.py +++ b/workflow/scripts/write_calibration_provenance.py @@ -25,7 +25,10 @@ PROJECT_ROOT = Path(__file__).resolve().parents[2] sys.path.insert(0, str(PROJECT_ROOT)) -from workflow.scripts.solve_namespace import load_merged_config # noqa: E402 +from workflow.scripts.solve_namespace import ( # noqa: E402 + load_merged_config, + resolve_gbd_anchoring, +) from workflow.validation.calibration_provenance import ( # noqa: E402 diff_snapshots, load_provenance, @@ -63,6 +66,15 @@ def main() -> int: action="store_true", help="Print the resolved calibration.source and exit (used by tools/calibrate)", ) + parser.add_argument( + "--print-anchoring", + action="store_true", + help=( + "Print the base config's resolved diet.anchor_groups_to_gbd " + "('true'/'false') and exit (used by tools/calibrate to pin every " + "calibration step to the base config's baseline diet)" + ), + ) args = parser.parse_args() configfiles = [PROJECT_ROOT / "config" / "default.yaml"] @@ -75,6 +87,10 @@ def main() -> int: print(source) return 0 + if args.print_anchoring: + print("true" if resolve_gbd_anchoring(config) else "false") + return 0 + snapshot = structural_snapshot(config) if args.check: diff --git a/workflow/validation/calibration_provenance.py b/workflow/validation/calibration_provenance.py index 241937ad..b6ca7287 100644 --- a/workflow/validation/calibration_provenance.py +++ b/workflow/validation/calibration_provenance.py @@ -24,7 +24,7 @@ from snakemake.logging import logger import yaml -from workflow.scripts.solve_namespace import _is_solve_time_key +from workflow.scripts.solve_namespace import _is_solve_time_key, resolve_gbd_anchoring CALIBRATION_DIR = Path("data/curated/calibration") PROVENANCE_FILENAME = "provenance.yaml" @@ -95,6 +95,13 @@ def walk(node: dict, prefix: str) -> None: snapshot[full] = v walk(config, "") + # diet.anchor_groups_to_gbd may hold the sentinel "match_health", which + # resolves through health.enabled -- a solve-time (and therefore exempt) + # key. Snapshot the resolved boolean so two configs with different + # resolved anchoring (and thus different baseline diets) never stamp + # identically. + if "diet.anchor_groups_to_gbd" in snapshot: + snapshot["diet.anchor_groups_to_gbd"] = resolve_gbd_anchoring(config) return snapshot From eee00ed4d0765ce994e35a1f9498b7c5b1471eb5 Mon Sep 17 00:00:00 2001 From: Koen van Greevenbroek Date: Wed, 1 Jul 2026 15:36:21 -0700 Subject: [PATCH 3/8] fix: engage deviation-penalty warm start and keep the best Broyden iterate The warm-start param pointed at the calibrated_yaml output itself, which Snakemake deletes before the job runs, so calibration always cold-started from the seeds. Point it at a side copy written next to the trace instead. When the iteration hits max_iter without converging, return the iterate with the smallest residual rather than the last one: with a discontinuous deviation response (LP basis switches near the target), the final Broyden step can land worse than an earlier iterate. --- workflow/rules/common.smk | 1 + workflow/rules/deviation_penalty.smk | 16 +++++++++--- .../scripts/calibrate_deviation_penalty.py | 25 +++++++++++++++++-- 3 files changed, 36 insertions(+), 6 deletions(-) diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk index f4ba0629..080f616f 100644 --- a/workflow/rules/common.smk +++ b/workflow/rules/common.smk @@ -21,6 +21,7 @@ from workflow.scripts.solve_namespace import ( _leaf_keys, deviation_penalty_uses_calibrated, resolve_gbd_anchoring, + resolve_pathvars, validate_scenario_config_schemas, validate_scenario_overrides, ) diff --git a/workflow/rules/deviation_penalty.smk b/workflow/rules/deviation_penalty.smk index 21801c8c..dd08e9fd 100644 --- a/workflow/rules/deviation_penalty.smk +++ b/workflow/rules/deviation_penalty.smk @@ -68,10 +68,18 @@ if _dp_cal_cfg["generate"]: tolerance=_dp_cal_cfg["tolerance"], max_iter=_dp_cal_cfg["max_iter"], trust_region_log=_dp_cal_cfg["trust_region_log"], - # Warm-start path is passed as a param (not an input) so Snakemake - # doesn't create a self-loop on calibrated_yaml. The script loads - # it iff the file exists on disk at run time. - previous_yaml=_dp_cal_cfg["calibrated_yaml"], + # Warm-start seed: a side copy of the previous calibrated yaml, + # written by the script next to the trace. It must not be the + # calibrated_yaml output itself (Snakemake deletes outputs before + # the job runs, so that warm start would never engage) nor a + # declared input/output (self-loop / same deletion). Pathvars are + # resolved explicitly since params are not path-expanded. The + # script loads it iff the file exists on disk at run time. + previous_yaml=str( + Path(resolve_pathvars(_trace_csv, PATH_ROOTS)).with_name( + "deviation_penalty_warm.yaml" + ) + ), name=name, resources: # Per-iteration solves use mem_mb / runtime configured in the diff --git a/workflow/scripts/calibrate_deviation_penalty.py b/workflow/scripts/calibrate_deviation_penalty.py index 2ebb801d..872794ff 100644 --- a/workflow/scripts/calibrate_deviation_penalty.py +++ b/workflow/scripts/calibrate_deviation_penalty.py @@ -335,7 +335,20 @@ def _record(iter_id: int, x_vec: np.ndarray, d_vec: np.ndarray, r_vec: np.ndarra x = x_new r = r_new - return x, trace + # Out of iterations: return the best iterate seen rather than the last. + # With a discontinuous deviation response (LP basis switches near the + # target), the final Broyden step can land worse than an earlier iterate. + best = min(trace, key=lambda row: row["resid_inf"]) + if best["iter"] != trace[-1]["iter"]: + logger.info( + "Not converged; returning best iterate %d (|r|_inf=%.4f) " + "instead of last (%.4f)", + best["iter"], + best["resid_inf"], + trace[-1]["resid_inf"], + ) + x_best = np.array([np.log(best[f"lambda_{c}"]) for c in components]) + return x_best, trace def main() -> None: @@ -407,7 +420,9 @@ def evaluate_fn(lambdas: dict[str, float], *, iter_id: int) -> dict[str, float]: components=components, ) l1_costs = {c: float(np.exp(x_final[i])) for i, c in enumerate(components)} - final_resid = trace[-1]["resid_inf"] + # The residual of the returned iterate: the converged row on success, + # otherwise the best iterate that broyden_iterate falls back to. + final_resid = min(row["resid_inf"] for row in trace) converged = final_resid < tol if not converged: @@ -449,6 +464,12 @@ def evaluate_fn(lambdas: dict[str, float], *, iter_id: int) -> dict[str, float]: sort_keys=False, ) out_path.write_text(header + body) + # Side copy for warm-starting the next calibration: Snakemake deletes the + # calibrated_yaml output before rerunning this rule, so the warm-start + # seed must live outside the rule's outputs. + prev_path = Path(smk.params.previous_yaml) + prev_path.parent.mkdir(parents=True, exist_ok=True) + prev_path.write_text(header + body) logger.info( "Wrote calibrated L1 costs to %s (%s, iters=%d, converged=%s)", out_path, From 13dc68210f513c04cbaebf22511b64ea374b9d6f Mon Sep 17 00:00:00 2001 From: Koen van Greevenbroek Date: Wed, 1 Jul 2026 15:37:01 -0700 Subject: [PATCH 4/8] fix: rescale refined grain in kcal normalisation when anchoring is off apply_kcal_normalisation unconditionally excluded the grain group from rescaling, justified by the cereal residual fix -- which only runs when whole_grains is GBD-anchored. With anchoring off, raw GDD-IA refined grain (the largest kcal group) stayed pinned while the entire kcal correction was forced onto the rest of the diet. In validation solves this produced a 246 Mt global dairy excess (demand squeezed to 585 Mt vs FAOSTAT supply ~919 Mt) and a 30 Mt rice-white shortage (demand pinned 38% above supply), and the distorted diet made the production-stability L1 calibration oscillate around a discontinuity without converging. Pin grain only when whole_grains is anchored, mirroring the residual-fix gating (and matching what docs/current_diets.rst already claimed). The anchored diet is unaffected. With the fix, the anchor-off grain composition lands within 1% of the anchored one, the dairy gap shrinks to the genuine GDD-vs-FBS disagreement (~176 Mt, absorbed by food_demand multipliers 1.23/1.31), rice-white needs no multiplier at all, and the stability calibration converges in 3 iterations. --- tests/test_baseline_diet.py | 57 ++++++++++++++++++---- workflow/scripts/estimate_baseline_diet.py | 14 ++++-- 2 files changed, 58 insertions(+), 13 deletions(-) diff --git a/tests/test_baseline_diet.py b/tests/test_baseline_diet.py index bedb17fb..9bc35c68 100644 --- a/tests/test_baseline_diet.py +++ b/tests/test_baseline_diet.py @@ -791,25 +791,30 @@ def test_food_level_kpg_hits_target_for_heterogeneous_group(self): ), f"country {c} kcal={kcal:.2f} != target 600" def test_anchored_groups_are_not_rescaled(self): - """Foods in GBD-anchored groups (and the refined-grain residual) - must keep their consumption unchanged; only unanchored foods get - scaled to close the residual. + """Foods in GBD-anchored groups (and, when whole_grains is anchored, + the refined-grain residual) must keep their consumption unchanged; + only unanchored foods get scaled to close the residual. """ baseline_diet = pd.DataFrame( { - "country": ["A", "A", "A"], - "food": ["red-meat-item", "cassava", "rice-white"], - "food_group": ["red_meat", "starchy_vegetable", "grain"], - "consumption_g_per_day": [50.0, 500.0, 100.0], + "country": ["A", "A", "A", "A"], + "food": ["red-meat-item", "cassava", "rice-white", "oat"], + "food_group": [ + "red_meat", + "starchy_vegetable", + "grain", + "whole_grains", + ], + "consumption_g_per_day": [50.0, 500.0, 100.0, 0.0], } ) - kpg = {"red-meat-item": 2.5, "cassava": 1.59, "rice-white": 3.65} + kpg = {"red-meat-item": 2.5, "cassava": 1.59, "rice-white": 3.65, "oat": 3.0} target_df = pd.DataFrame({"country": ["A"], "kcal_target_modelled": [1000.0]}) result = apply_kcal_normalisation( baseline_diet, target_df, - gbd_anchored_groups={"red_meat"}, + gbd_anchored_groups={"red_meat", "whole_grains"}, kcal_per_g_food=kpg, ) @@ -832,6 +837,40 @@ def test_anchored_groups_are_not_rescaled(self): # factor = 510/795 = 0.6415. post: 500 * 0.6415 = 320.75. assert cassava_g == pytest.approx(320.75, rel=1e-3) + def test_grain_rescaled_when_whole_grains_not_anchored(self): + """Without the whole-grain anchor the cereal residual fix does not + run, so refined grain is an ordinary estimate and must be rescaled + along with every other unanchored food (uniformly, here, since no + group is anchored at all). + """ + baseline_diet = pd.DataFrame( + { + "country": ["A", "A"], + "food": ["cassava", "rice-white"], + "food_group": ["starchy_vegetable", "grain"], + "consumption_g_per_day": [500.0, 100.0], + } + ) + kpg = {"cassava": 1.59, "rice-white": 3.65} + # Pre-scaling kcal = 500*1.59 + 100*3.65 = 1160; target 580 -> factor 0.5. + target_df = pd.DataFrame({"country": ["A"], "kcal_target_modelled": [580.0]}) + + result = apply_kcal_normalisation( + baseline_diet, + target_df, + gbd_anchored_groups=set(), + kcal_per_g_food=kpg, + ) + + grain_g = float( + result.loc[result["food"] == "rice-white", "consumption_g_per_day"].iloc[0] + ) + cassava_g = float( + result.loc[result["food"] == "cassava", "consumption_g_per_day"].iloc[0] + ) + assert grain_g == pytest.approx(50.0) + assert cassava_g == pytest.approx(250.0) + def test_missing_food_raises(self): """Foods without a kcal/g entry should raise; the previous code silently substituted zero. diff --git a/workflow/scripts/estimate_baseline_diet.py b/workflow/scripts/estimate_baseline_diet.py index acc998f5..dcdd5876 100644 --- a/workflow/scripts/estimate_baseline_diet.py +++ b/workflow/scripts/estimate_baseline_diet.py @@ -366,10 +366,16 @@ def apply_kcal_normalisation( otherwise overshoot the GDD-IA target by 20-35% per country and inflate Sub-Saharan Africa's per-capita intake by ~170 kcal/day. - Anchored groups: those in ``gbd_anchored_groups`` plus ``grain`` - (which is set by the cereal residual fix and shouldn't be rescaled). + Anchored groups: those in ``gbd_anchored_groups``, plus ``grain`` when + the cereal residual fix ran (i.e. when whole_grains is anchored): the + fix sets refined grain from the cereal energy budget, so rescaling it + would undo that. With anchoring off, grain is an ordinary GDD/FAOSTAT + estimate and is rescaled like every other group -- pinning it would + force the (largest) grain kcal share onto the rest of the diet. """ - anchored = set(gbd_anchored_groups) | {GRAIN_GROUP} + anchored = set(gbd_anchored_groups) + if WHOLE_GRAINS_GROUP in gbd_anchored_groups: + anchored |= {GRAIN_GROUP} targets = kcal_target_df.set_index("country")["kcal_target_modelled"] df = baseline_diet.copy() @@ -1165,7 +1171,7 @@ def _within_bucket_split(bucket_foods: list[str]) -> dict[str, float]: weights = {f: w / total for f, w in weights.items()} return weights equal = 1.0 / len(bucket_foods) - return {f: equal for f in bucket_foods} + return dict.fromkeys(bucket_foods, equal) shares: dict[str, float] = {} if total_production > 0: From 1e55125ba70891b522023987f00ce99d390ff860 Mon Sep 17 00:00:00 2001 From: Koen van Greevenbroek Date: Wed, 1 Jul 2026 15:37:25 -0700 Subject: [PATCH 5/8] feat: recalibrate the default artefact set anchor-off and add a gbd-anchored set The committed calibration artefacts were fit against the GBD-anchored baseline diet, while the default config now runs with anchoring off -- the regeneration deferred when health was made optional. Resolve the tangle by splitting the artefacts into two git-tracked sets: - default: regenerated with tools/calibrate against the anchoring-off baseline diet (with the kcal-normalisation grain fix in place). The production-stability calibration converges in 3 iterations and the L1 centre stays close to the anchored one (cropland 1.47, grassland 0.163, feed 0.079), confirming no hidden supply/demand mismatch. - gbd-anchored: the previous artefacts, preserved bit-for-bit and stamped against the anchored structural config. The health-enabled configs (validation, gsa, gsa_fixed_diet, doc figures/validation) consume it via calibration.source, so the paper/GSA pipeline keeps the exact artefacts it was run with. Update the calibration/current-diets docs and CHANGELOG from the "regeneration deferred" story to the two-set layout. --- .claude/skills/model-calibration/SKILL.md | 14 +- AGENTS.md | 8 +- CHANGELOG.md | 13 +- config/default.yaml | 7 +- config/example.yaml | 4 + config/gsa.yaml | 6 + config/gsa_fixed_diet.yaml | 5 + config/validation.yaml | 5 + .../calibration/default/animal_cost.csv | 2008 ++-- .../curated/calibration/default/crop_cost.csv | 8913 +++++++++-------- .../default/deviation_penalty.yaml | 8 +- .../calibration/default/exogenous_feed.csv | 64 +- .../calibration/default/exogenous_forage.csv | 16 +- .../calibration/default/fodder_conversion.csv | 10 +- .../calibration/default/food_demand.csv | 106 +- .../calibration/default/food_waste.yaml | 30 +- .../calibration/default/grassland_cost.csv | 202 +- .../calibration/default/grassland_yield.csv | 80 +- .../calibration/default/provenance.yaml | 5 +- .../calibration/gbd-anchored/animal_cost.csv | 1226 +++ .../calibration/gbd-anchored/crop_cost.csv | 6152 ++++++++++++ .../gbd-anchored/deviation_penalty.yaml | 22 + .../gbd-anchored/exogenous_feed.csv | 99 + .../gbd-anchored/exogenous_forage.csv | 178 + .../gbd-anchored/fodder_conversion.csv | 178 + .../calibration/gbd-anchored/food_demand.csv | 59 + .../calibration/gbd-anchored/food_waste.yaml | 35 + .../gbd-anchored/grassland_cost.csv | 175 + .../gbd-anchored/grassland_yield.csv | 178 + .../calibration/gbd-anchored/provenance.yaml | 854 ++ docs/calibration.rst | 27 +- docs/config/doc_figures.yaml | 5 + docs/config/doc_validation.yaml | 5 + docs/current_diets.rst | 15 +- 34 files changed, 14954 insertions(+), 5758 deletions(-) create mode 100644 data/curated/calibration/gbd-anchored/animal_cost.csv create mode 100644 data/curated/calibration/gbd-anchored/crop_cost.csv create mode 100644 data/curated/calibration/gbd-anchored/deviation_penalty.yaml create mode 100644 data/curated/calibration/gbd-anchored/exogenous_feed.csv create mode 100644 data/curated/calibration/gbd-anchored/exogenous_forage.csv create mode 100644 data/curated/calibration/gbd-anchored/fodder_conversion.csv create mode 100644 data/curated/calibration/gbd-anchored/food_demand.csv create mode 100644 data/curated/calibration/gbd-anchored/food_waste.yaml create mode 100644 data/curated/calibration/gbd-anchored/grassland_cost.csv create mode 100644 data/curated/calibration/gbd-anchored/grassland_yield.csv create mode 100644 data/curated/calibration/gbd-anchored/provenance.yaml diff --git a/.claude/skills/model-calibration/SKILL.md b/.claude/skills/model-calibration/SKILL.md index e56e5754..e4830215 100644 --- a/.claude/skills/model-calibration/SKILL.md +++ b/.claude/skills/model-calibration/SKILL.md @@ -1,6 +1,6 @@ --- name: model-calibration -description: Run, refresh, or diagnose the model's calibration pipeline (feed -> food_waste -> food_demand -> cost -> stability) that produces the per-config artefact sets under `data/curated/calibration//` (the default set is git-tracked). Covers the dependency order, the `tools/calibrate` wrapper, realistic runtime expectations, when each kind of upstream change forces a re-run, and how to diagnose the most common failure mode: a hidden supply/demand mismatch that inflates the production-stability L1 cost. Use whenever calibration is relevant -- the user touches inputs/build logic that feed the calibration solves, calibration artefacts look off, or a refresh of the artefacts is needed after a model/data change. +description: Run, refresh, or diagnose the model's calibration pipeline (feed -> food_waste -> food_demand -> cost -> stability) that produces the per-config artefact sets under `data/curated/calibration//` (the `default` and `gbd-anchored` sets are git-tracked). Covers the dependency order, the `tools/calibrate` wrapper, realistic runtime expectations, when each kind of upstream change forces a re-run, and how to diagnose the most common failure mode: a hidden supply/demand mismatch that inflates the production-stability L1 cost. Use whenever calibration is relevant -- the user touches inputs/build logic that feed the calibration solves, calibration artefacts look off, or a refresh of the artefacts is needed after a model/data change. ---