PolicyEngine
diff --git a/‎changelog.d/added/consumption-wealth-taxes.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/added/consumption-wealth-taxes.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎changelog.d/added/python-data-year-fallback.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/added/python-data-year-fallback.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎changelog.d/added/rebuild-pipeline.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/added/rebuild-pipeline.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎changelog.d/added/rf-parallel.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/added/rf-parallel.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎interfaces/python/policyengine_uk_compiled/data.py‎
Lines changed: 68 additions & 2 deletions b/‎interfaces/python/policyengine_uk_compiled/data.py‎
Lines changed: 68 additions & 2 deletions
diff --git a/‎parameters/2025_26.yaml‎
Lines changed: 49 additions & 0 deletions b/‎parameters/2025_26.yaml‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎scripts/rebuild_all.py‎
Lines changed: 225 additions & 0 deletions b/‎scripts/rebuild_all.py‎
Lines changed: 225 additions & 0 deletions
@@ -0,0 +1 @@
+Consumption and wealth taxes: fuel duty, alcohol duty, tobacco duty, capital gains tax, stamp duty land tax, annual wealth tax (hypothetical), and reformable council tax. LCFS alcohol/tobacco spending split into separate fields for accurate duty modelling. All new taxes included in household total_tax and program breakdown output.
@@ -0,0 +1 @@
+`policyengine_uk_compiled.data` now falls back to the nearest available year on the bucket when the requested year is missing, letting the Rust engine uprate forward at runtime. Non-FRS datasets (SPI, LCFS, WAS, EFRS) are now usable from the Python wrapper without any per-year bucket upload.
@@ -0,0 +1 @@
+`scripts/rebuild_all.py` rebuilds every clean dataset from raw UKDS files on GCS. Downloads raw tab files from `ukds/<dataset>/<ref>/`, runs Rust extraction, and uploads clean CSVs back. Supports filtering with `--only` and `--year`. Companion raw files uploaded to `gs://policyengine-uk-microdata/ukds/` for FRS 2022/23 and 2023/24, LCFS 2019/20, 2021/22, 2022/23, SPI 2021/22 and 2022/23, and WAS rounds 7 and 8.
@@ -0,0 +1 @@
+Parallelised EFRS Random Forest training and prediction with Rayon. The 28 wealth and consumption RF models now train in parallel across cores and share a single DenseMatrix, cutting EFRS build time substantially. Tree count reduced from 100 to 50, which keeps accuracy while halving per-model cost.
@@ -71,20 +71,86 @@ def _get_credentials() -> tuple[str, str]:
 DATASETS = ("frs", "efrs", "lcfs", "spi", "was")
 
 
+def _list_available_years(dataset: str, access_key: str, secret_key: str) -> list[int]:
+    """List years available on the bucket for a given dataset.
+
+    Returns a sorted list of integer years found under the `<dataset>/` prefix.
+    """
+    import re
+    keys = []
+    marker = ""
+    while True:
+        path = f"/?prefix={dataset}/&marker={marker}"
+        headers = _sign_request("GET", "/", access_key, secret_key)
+        url = f"https://{GCS_HOST}/{GCS_BUCKET}{path}"
+        req = urllib.request.Request(url, headers=headers)
+        with urllib.request.urlopen(req) as resp:
+            body = resp.read().decode()
+        found = re.findall(r"<Key>([^<]+)</Key>", body)
+        if not found:
+            break
+        keys.extend(found)
+        if "<IsTruncated>true</IsTruncated>" not in body:
+            break
+        marker = found[-1]
+
+    years = set()
+    year_re = re.compile(rf"^{dataset}/(\d{{4}})/")
+    for key in keys:
+        m = year_re.match(key)
+        if m:
+            years.add(int(m.group(1)))
+    return sorted(years)
+
+
+def _pick_nearest_year(available: list[int], requested: int) -> int:
+    """Pick the nearest year to requested from available.
+
+    Prefers the latest year ≤ requested (so uprating moves forward), falling
+    back to the earliest available year if none is ≤ requested.
+    """
+    if not available:
+        raise FileNotFoundError("No years available on bucket")
+    candidates = [y for y in available if y <= requested]
+    if candidates:
+        return max(candidates)
+    return min(available)
+
+
 def ensure_dataset_year(dataset: str, year: int) -> Path:
     """Ensure clean CSVs for a dataset/year are available locally, downloading if needed.
 
-    Returns the path to the year directory (e.g. ~/.policyengine-uk-data/frs/2026/).
+    If the requested year isn't on the bucket, downloads the nearest available
+    year and returns its directory. The Rust engine handles uprating from the
+    downloaded year to the requested year at run time.
+
+    Returns the path to the year directory actually downloaded (may differ from
+    the requested year).
     """
     year_dir = LOCAL_CACHE / dataset / str(year)
     expected_files = ["persons.csv", "benunits.csv", "households.csv"]
     if all((year_dir / f).exists() for f in expected_files):
         return year_dir
 
     access_key, secret_key = _get_credentials()
+
+    # Determine which year to download. If the requested year isn't on the
+    # bucket, fall back to the nearest available.
+    available = _list_available_years(dataset, access_key, secret_key)
+    # If we already cached the nearest year locally, use that.
+    if available:
+        download_year = _pick_nearest_year(available, year)
+        if download_year != year:
+            near_dir = LOCAL_CACHE / dataset / str(download_year)
+            if all((near_dir / f).exists() for f in expected_files):
+                return near_dir
+            year_dir = near_dir
+    else:
+        download_year = year
+
     year_dir.mkdir(parents=True, exist_ok=True)
     for f in expected_files:
-        key = f"{dataset}/{year}/{f}"
+        key = f"{dataset}/{download_year}/{f}"
         dest = year_dir / f
         if dest.exists():
             continue
 
@@ -214,6 +214,55 @@ vat:
   reduced_rate: 0.05
   zero_rate: 0.0
 
+fuel_duty:
+  # Hydrocarbon Oil Duties Act 1979 s.6; SI 2022/269 (5p cut); extended to 2025/26
+  # 52.95p/litre on unleaded petrol and diesel
+  petrol_rate_per_litre: 0.5295
+  diesel_rate_per_litre: 0.5295
+  # Average pump prices (BEIS weekly fuel prices, Q1 2025/26 average)
+  average_petrol_price_per_litre: 1.35
+  average_diesel_price_per_litre: 1.40
+
+alcohol_duty:
+  # Alcoholic Liquor Duties Act 1979, reformed by Finance (No. 2) Act 2023 s.46-88
+  # OBR 2025/26: £11.9bn revenue from ~£30bn household alcohol spending
+  effective_rate: 0.40
+
+tobacco_duty:
+  # Tobacco Products Duty Act 1979; escalator RPI + 2% (Finance Act 2024)
+  # OBR 2025/26: £8bn revenue from ~£11bn household tobacco spending
+  effective_rate: 0.72
+
+council_tax:
+  # Local Government Finance Act 1992 s.1-5
+  # DLUHC Council Tax levels statistics 2025/26: England average Band D = £2,280
+  average_band_d: 2280.0
+
+capital_gains_tax:
+  # Taxation of Chargeable Gains Act 1992; Finance Act 2024 s.7 (rate increases)
+  # AEA reduced to £3,000 from 2024/25 (Finance Act 2023 s.4)
+  annual_exempt_amount: 3000.0
+  basic_rate: 0.18
+  higher_rate: 0.24
+  realisation_rate: 0.50
+
+stamp_duty:
+  # Finance Act 2003 s.55, as amended; bands from 1 April 2025
+  bands:
+    - { rate: 0.0, threshold: 0 }
+    - { rate: 0.02, threshold: 125001 }
+    - { rate: 0.05, threshold: 250001 }
+    - { rate: 0.10, threshold: 925001 }
+    - { rate: 0.12, threshold: 1500001 }
+  annual_purchase_probability: 0.043  # ~1/23 year average holding period
+
+wealth_tax:
+  # Hypothetical — no current UK legislation. Disabled by default.
+  # Wealth Tax Commission (2020) proposed 1% above £10m.
+  enabled: false
+  threshold: 10000000.0
+  rate: 0.01
+
 growth_factors:
   # OBR Economic and Fiscal Outlook, March 2026
   # Table 1.7 (Inflation) and Table 1.6 (Labour Market)
 
@@ -0,0 +1,225 @@
+"""Rebuild every clean dataset from raw UKDS files held on GCS.
+
+Pipeline per job: download raw tab files from gs://policyengine-uk-microdata/ukds/
+→ run the Rust extraction → upload clean CSVs to gs://policyengine-uk-microdata/<dataset>/<year>/.
+
+Assumes:
+  - `gcloud storage` CLI is authenticated and can read/write the bucket.
+  - `cargo` is on PATH and the workspace builds cleanly.
+
+Usage:
+    python scripts/rebuild_all.py                    # rebuild everything
+    python scripts/rebuild_all.py --only lcfs        # rebuild just LCFS years
+    python scripts/rebuild_all.py --only frs --year 2023
+    python scripts/rebuild_all.py --only efrs        # rebuild EFRS for all FRS years we have
+    python scripts/rebuild_all.py --work-dir /tmp/pe # use a fixed working dir (cached)
+    python scripts/rebuild_all.py --keep             # keep the working dir after running
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import shutil
+import subprocess
+import sys
+import tempfile
+from dataclasses import dataclass
+from pathlib import Path
+
+BUCKET = "gs://policyengine-uk-microdata"
+RAW_PREFIX = f"{BUCKET}/ukds"
+REPO_ROOT = Path(__file__).resolve().parent.parent
+
+# Extra search paths for gcloud/cargo that might not be on the default subprocess PATH.
+_EXTRA_PATHS = [
+    Path.home() / ".cargo" / "bin",
+    Path.home() / "Downloads" / "google-cloud-sdk" / "bin",
+    Path("/opt/homebrew/bin"),
+    Path("/usr/local/bin"),
+]
+for _p in _EXTRA_PATHS:
+    if _p.is_dir() and str(_p) not in os.environ.get("PATH", ""):
+        os.environ["PATH"] = f"{_p}:{os.environ.get('PATH', '')}"
+
+
+def _require(tool: str) -> None:
+    if shutil.which(tool) is None:
+        raise SystemExit(
+            f"{tool!r} not found on PATH. Install it or add it to PATH before running."
+        )
+
+
+@dataclass
+class ExtractJob:
+    """One raw survey → clean CSV extraction."""
+    dataset: str       # frs | lcfs | spi | was
+    year: int          # target fiscal year for the clean output directory
+    raw_ref: str       # path under ukds/ (e.g. "frs/2023", "was/round_7")
+    rust_flag: str     # --frs | --lcfs | --spi | --was
+
+
+# Manifest of everything we can rebuild. Extend as new raw years arrive on the bucket.
+JOBS: list[ExtractJob] = [
+    ExtractJob("frs",  2022, "frs/2022",     "--frs"),
+    ExtractJob("frs",  2023, "frs/2023",     "--frs"),
+    ExtractJob("lcfs", 2019, "lcfs/2019",    "--lcfs"),
+    ExtractJob("lcfs", 2021, "lcfs/2021",    "--lcfs"),
+    ExtractJob("lcfs", 2022, "lcfs/2022",    "--lcfs"),
+    ExtractJob("spi",  2021, "spi/2021",     "--spi"),
+    ExtractJob("spi",  2022, "spi/2022",     "--spi"),
+    ExtractJob("was",  2020, "was/round_7",  "--was"),
+    ExtractJob("was",  2022, "was/round_8",  "--was"),
+]
+
+# EFRS pipeline: (fiscal_year, frs_year, was_ref, lcfs_ref)
+# Picks the raw references it composes from.
+EFRS_JOBS: list[tuple[int, int, str, str]] = [
+    (2023, 2023, "was/round_7", "lcfs/2021"),
+]
+
+
+def run(cmd: list, cwd: Path | None = None) -> None:
+    print(f"  $ {' '.join(str(c) for c in cmd)}", flush=True)
+    subprocess.run([str(c) for c in cmd], cwd=cwd, check=True)
+
+
+def gcs_copy_in(ref: str, dest: Path) -> None:
+    """Download everything under ukds/<ref>/ into dest/."""
+    dest.mkdir(parents=True, exist_ok=True)
+    # gcloud storage cp -r copies the listed objects verbatim.
+    run(["gcloud", "storage", "cp", "-r", f"{RAW_PREFIX}/{ref}/*", str(dest)])
+
+
+def gcs_copy_out(local_dir: Path, dataset: str, year: int) -> None:
+    dest = f"{BUCKET}/{dataset}/{year}/"
+    # Upload the three clean CSVs only; ignore any stray files.
+    files = sorted(local_dir.glob("*.csv"))
+    if not files:
+        raise SystemExit(f"No CSV files in {local_dir}; extraction probably failed")
+    run(["gcloud", "storage", "cp", *[str(f) for f in files], dest])
+
+
+def ensure_raw(ref: str, work: Path) -> Path:
+    """Download raw ukds/<ref> to work/raw/<ref>, caching if already present."""
+    raw_dir = work / "raw" / ref
+    if raw_dir.is_dir() and any(raw_dir.iterdir()):
+        print(f"  (cached) {raw_dir}")
+        return raw_dir
+    gcs_copy_in(ref, raw_dir)
+    return raw_dir
+
+
+def extract_one(job: ExtractJob, work: Path) -> Path:
+    print(f"\n=== {job.dataset.upper()} {job.year} ===")
+    raw_dir = ensure_raw(job.raw_ref, work)
+    clean_dir = work / "clean" / job.dataset / str(job.year)
+    clean_dir.mkdir(parents=True, exist_ok=True)
+    run(
+        [
+            "cargo", "run", "--release", "--quiet", "--",
+            job.rust_flag, str(raw_dir),
+            "--year", str(job.year),
+            "--extract", str(clean_dir),
+        ],
+        cwd=REPO_ROOT,
+    )
+    gcs_copy_out(clean_dir, job.dataset, job.year)
+    return clean_dir
+
+
+def extract_efrs(fiscal_year: int, frs_year: int, was_ref: str, lcfs_ref: str, work: Path) -> None:
+    print(f"\n=== EFRS {fiscal_year} (from FRS {frs_year}, {was_ref}, {lcfs_ref}) ===")
+
+    # Need clean FRS as the base: if we already extracted it in this run it's on disk;
+    # otherwise download the clean files from the bucket into work/clean/frs/<year>/.
+    frs_clean = work / "clean" / "frs" / str(frs_year)
+    if not frs_clean.is_dir() or not (frs_clean / "households.csv").exists():
+        frs_clean.mkdir(parents=True, exist_ok=True)
+        run([
+            "gcloud", "storage", "cp",
+            f"{BUCKET}/frs/{frs_year}/persons.csv",
+            f"{BUCKET}/frs/{frs_year}/benunits.csv",
+            f"{BUCKET}/frs/{frs_year}/households.csv",
+            str(frs_clean) + "/",
+        ])
+
+    frs_base = work / "clean" / "frs"  # parent dir with YYYY/ subdirs
+    was_raw = ensure_raw(was_ref, work)
+    lcfs_raw = ensure_raw(lcfs_ref, work)
+
+    efrs_out = work / "clean" / "efrs" / str(fiscal_year)
+    efrs_out.mkdir(parents=True, exist_ok=True)
+    run(
+        [
+            "cargo", "run", "--release", "--quiet", "--",
+            "--extract-efrs", str(efrs_out),
+            "--data", str(frs_base),
+            "--year", str(fiscal_year),
+            "--was-dir", str(was_raw),
+            "--lcfs-dir", str(lcfs_raw),
+        ],
+        cwd=REPO_ROOT,
+    )
+    gcs_copy_out(efrs_out, "efrs", fiscal_year)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__ or "")
+    parser.add_argument(
+        "--only",
+        choices=["frs", "lcfs", "spi", "was", "efrs"],
+        help="Only rebuild one dataset family",
+    )
+    parser.add_argument("--year", type=int, help="Only rebuild this fiscal year")
+    parser.add_argument(
+        "--work-dir",
+        type=Path,
+        help="Use this directory instead of a temp dir (enables caching)",
+    )
+    parser.add_argument(
+        "--keep",
+        action="store_true",
+        help="Keep the working dir after running (ignored with --work-dir)",
+    )
+    args = parser.parse_args()
+
+    _require("gcloud")
+    _require("cargo")
+
+    if args.work_dir:
+        work = args.work_dir.resolve()
+        work.mkdir(parents=True, exist_ok=True)
+        cleanup = False
+    else:
+        work = Path(tempfile.mkdtemp(prefix="pe-uk-rebuild-"))
+        cleanup = not args.keep
+    print(f"Working directory: {work}")
+
+    selected_jobs = JOBS
+    if args.only and args.only != "efrs":
+        selected_jobs = [j for j in JOBS if j.dataset == args.only]
+    if args.year is not None:
+        selected_jobs = [j for j in selected_jobs if j.year == args.year]
+
+    run_efrs = args.only in (None, "efrs")
+
+    try:
+        if args.only != "efrs":
+            for job in selected_jobs:
+                extract_one(job, work)
+
+        if run_efrs:
+            for fiscal_year, frs_year, was_ref, lcfs_ref in EFRS_JOBS:
+                if args.year is not None and fiscal_year != args.year:
+                    continue
+                extract_efrs(fiscal_year, frs_year, was_ref, lcfs_ref, work)
+    finally:
+        if cleanup:
+            shutil.rmtree(work, ignore_errors=True)
+
+    print("\nAll done.")
+
+
+if __name__ == "__main__":
+    sys.exit(main())
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Consumption and wealth taxes: fuel duty, alcohol duty, tobacco duty, capital gains tax, stamp duty land tax, annual wealth tax (hypothetical), and reformable council tax. LCFS alcohol/tobacco spending split into separate fields for accurate duty modelling. All new taxes included in household total_tax and program breakdown output.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+`policyengine_uk_compiled.data` now falls back to the nearest available year on the bucket when the requested year is missing, letting the Rust engine uprate forward at runtime. Non-FRS datasets (SPI, LCFS, WAS, EFRS) are now usable from the Python wrapper without any per-year bucket upload.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+`scripts/rebuild_all.py` rebuilds every clean dataset from raw UKDS files on GCS. Downloads raw tab files from `ukds/<dataset>/<ref>/`, runs Rust extraction, and uploads clean CSVs back. Supports filtering with `--only` and `--year`. Companion raw files uploaded to `gs://policyengine-uk-microdata/ukds/` for FRS 2022/23 and 2023/24, LCFS 2019/20, 2021/22, 2022/23, SPI 2021/22 and 2022/23, and WAS rounds 7 and 8.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Parallelised EFRS Random Forest training and prediction with Rayon. The 28 wealth and consumption RF models now train in parallel across cores and share a single DenseMatrix, cutting EFRS build time substantially. Tree count reduced from 100 to 50, which keeps accuracy while halving per-model cost.`