From 2cf54286d19a5547fdc35a1863739613afeba909 Mon Sep 17 00:00:00 2001
From: Thorsten Fuchs <thorfuchs@outlook.com>
Date: Tue, 12 May 2026 00:07:31 +0200
Subject: [PATCH 1/2] seo(sitemap): split single sitemap into sitemap index + 6
 child sitemaps
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Before: a single /sitemap.xml urlset emitted ~8,875 URLs, with the 7,800+
auto-generated registry/bibliography/monograph/results-facet/prediction
pages competing for the same crawl-budget signal as the ~1,000
human-authored canonical L0-L4 pages. In Google Search Console, the
report aggregator therefore could not distinguish indexing progress on
canonical content from indexing progress on bulk programmatic content.

After: /sitemap.xml is a sitemap INDEX referencing six mutually-
exclusive child sitemaps, classified by URL prefix in the single
source of truth `_includes/sitemap-bucket.liquid`:

  /sitemap-core.xml          1,018 human-authored L0-L4 pages
  /sitemap-registry.xml      4,570 /registry/* registry objects
  /sitemap-bibliography.xml  1,136 /bibliography/* references
  /sitemap-corpus-bulk.xml   1,134 /corpus/monographs/* + /corpus/taulib/*
  /sitemap-results-bulk.xml    939 /results/{additional-noteworthy-results,
                                    problem,physics,life,metaphysics,
                                    mathematics,calibration-cascade,
                                    falsifications,predictions}/*
  /sitemap-predictions.xml      67 /predictions/*
  -------------------------------------------------------
  TOTAL                      8,864  (matches single-file v1 exactly)

### Why this helps

1. GSC can now report per-sitemap indexing status. The user can see
   whether the 4,570 registry pages indexing rate diverges from the
   1,018 canonical-content pages indexing rate, and prioritize fixes
   accordingly.

2. Google's crawler treats sitemap files as distinct signals. A small
   focused sitemap-core (1,018 pages, 175KB) is far more likely to be
   fully crawled and indexed than a single 1.2MB file mixing canonical
   content with bulk programmatic content.

3. Crawl budget for the registry/bibliography/monograph bulk content
   can now scale independently from canonical content over time, as
   Google decides how much of each templated collection to index.

### Implementation

- `_includes/sitemap-bucket.liquid`: deterministic URL-prefix
  classifier that assigns each item to exactly one of six buckets
  (verified mutually exclusive: 0 overlapping URLs across children).
- `sitemap.xml`: rewritten from urlset to sitemapindex listing the
  six children with `<lastmod>{{ site.time }}</lastmod>` (any source
  change triggers a full rebuild).
- Six new child sitemap templates, each filtering items by
  `_bucket == "<name>"` after redirect and `sitemap: false` exclusion.

### Verified locally

- All six child sitemaps build correctly via `bundle exec jekyll build`.
- Total URLs across children = 8,864 (matches the previous single-file
  count exactly; no URL lost or duplicated).
- Mutual-exclusivity check: 0 URLs appear in two or more children.
- All 18 canonical L1/L2/L3 spot-check URLs land in sitemap-core
  (homepage, all 11 lane roots, Wolfram comparison, construction-spine
  steps, substrate-non-deferral, release-manifest, anchor docs WP000
  and WP004). Browse aggregator pages
  (/results/{predictions,falsifications}/browse/) land in
  sitemap-results-bulk by URL-prefix rule, which is semantically
  correct (they aggregate templated content).

### robots.txt and external references

- robots.txt continues to reference `/sitemap.xml`. Google
  auto-discovers child sitemaps from the index — no change required.
- No internal Liquid/include code references the old sitemap structure
  (verified via grep of `_includes/`, `_layouts/`, `_config.yml`).
- GSC will need to re-fetch the sitemap to see the index structure;
  this happens automatically on next crawl, typically within hours.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 _includes/sitemap-bucket.liquid | 36 +++++++++++++++
 sitemap-bibliography.xml        | 27 ++++++++++++
 sitemap-core.xml                | 31 +++++++++++++
 sitemap-corpus-bulk.xml         | 27 ++++++++++++
 sitemap-predictions.xml         | 29 ++++++++++++
 sitemap-registry.xml            | 29 ++++++++++++
 sitemap-results-bulk.xml        | 32 ++++++++++++++
 sitemap.xml                     | 78 +++++++++++++++++++--------------
 8 files changed, 255 insertions(+), 34 deletions(-)
 create mode 100644 _includes/sitemap-bucket.liquid
 create mode 100644 sitemap-bibliography.xml
 create mode 100644 sitemap-core.xml
 create mode 100644 sitemap-corpus-bulk.xml
 create mode 100644 sitemap-predictions.xml
 create mode 100644 sitemap-registry.xml
 create mode 100644 sitemap-results-bulk.xml
diff --git a/_includes/sitemap-bucket.liquid b/_includes/sitemap-bucket.liquid
new file mode 100644
index 000000000..f8d91d575
--- /dev/null
+++ b/_includes/sitemap-bucket.liquid
@@ -0,0 +1,36 @@
+{%- comment -%}
+sitemap-bucket.liquid — single source of truth for child-sitemap classification.
+
+Sets `_bucket` based on `include.url`. Buckets (mutually exclusive):
+
+  * core           — human-authored L0-L4 pages (default)
+  * registry       — /registry/* (4,570 auto-generated registry objects)
+  * bibliography   — /bibliography/* (1,149 auto-generated references)
+  * corpus-bulk    — /corpus/monographs/*, /corpus/taulib/* (1,129 auto-generated)
+  * results-bulk   — /results/{additional-noteworthy-results,problem,physics,life,
+                     metaphysics,mathematics,calibration-cascade,falsifications,
+                     predictions}/* (925 auto-generated facet/path pages)
+  * predictions    — /predictions/* (~67 auto-generated)
+
+The split is by URL prefix (deterministic, no frontmatter inspection needed) so
+it works identically across all six child sitemaps and the sitemap index.
+
+Usage:
+  {% include sitemap-bucket.liquid url=item.url %}
+  {% if _bucket == "core" %} ... {% endif %}
+{%- endcomment -%}
+{%- assign _parts = include.url | split: "/" -%}
+{%- assign _t = _parts[1] -%}
+{%- assign _s = _parts[2] -%}
+{%- assign _bucket = "core" -%}
+{%- if _t == "registry" -%}{%- assign _bucket = "registry" -%}{%- endif -%}
+{%- if _t == "bibliography" -%}{%- assign _bucket = "bibliography" -%}{%- endif -%}
+{%- if _t == "predictions" -%}{%- assign _bucket = "predictions" -%}{%- endif -%}
+{%- if _t == "corpus" -%}
+  {%- if _s == "monographs" -%}{%- assign _bucket = "corpus-bulk" -%}{%- endif -%}
+  {%- if _s == "taulib" -%}{%- assign _bucket = "corpus-bulk" -%}{%- endif -%}
+{%- endif -%}
+{%- if _t == "results" -%}
+  {%- assign _rb_list = "additional-noteworthy-results,problem,physics,life,metaphysics,mathematics,calibration-cascade,falsifications,predictions" | split: "," -%}
+  {%- if _rb_list contains _s -%}{%- assign _bucket = "results-bulk" -%}{%- endif -%}
+{%- endif -%}
diff --git a/sitemap-bibliography.xml b/sitemap-bibliography.xml
new file mode 100644
index 000000000..9fcf7eff2
--- /dev/null
+++ b/sitemap-bibliography.xml
@@ -0,0 +1,27 @@
+---
+# sitemap-bibliography.xml — /bibliography/* references (~1,149)
+#
+# Auto-generated bibliography entries.
+#
+# Inclusion rule: classifier `_bucket == "bibliography"` from
+# `_includes/sitemap-bucket.liquid`.
+layout: null
+permalink: /sitemap-bibliography.xml
+sitemap: false
+---
+{%- assign all_items = site.html_pages | concat: site.documents -%}
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+{%- for item in all_items -%}
+  {%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
+  {%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
+  {%- include sitemap-bucket.liquid url=item.url -%}
+  {%- unless _bucket == "bibliography" -%}{%- continue -%}{%- endunless -%}
+  {%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
+  <url>
+    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
+    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
+  </url>
+{%- endfor -%}
+
+</urlset>
diff --git a/sitemap-core.xml b/sitemap-core.xml
new file mode 100644
index 000000000..56dbe7f88
--- /dev/null
+++ b/sitemap-core.xml
@@ -0,0 +1,31 @@
+---
+# sitemap-core.xml — human-authored L0-L4 pages
+#
+# Highest crawl-priority sitemap: ~1,047 manually-authored pages including
+# all top-level lanes (program, agenda, corpus root + construction-spine +
+# foundational-hinges, verify, publications, impact, engage, discover,
+# media, results overview pages, etc.) but excluding the 7,800+ auto-
+# generated programmatic pages routed via the other five child sitemaps.
+#
+# Inclusion rule: classifier `_bucket == "core"` from
+# `_includes/sitemap-bucket.liquid`.
+layout: null
+permalink: /sitemap-core.xml
+sitemap: false
+---
+{%- assign all_items = site.html_pages | concat: site.documents -%}
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+{%- for item in all_items -%}
+  {%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
+  {%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
+  {%- include sitemap-bucket.liquid url=item.url -%}
+  {%- unless _bucket == "core" -%}{%- continue -%}{%- endunless -%}
+  {%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
+  <url>
+    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
+    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
+  </url>
+{%- endfor -%}
+
+</urlset>
diff --git a/sitemap-corpus-bulk.xml b/sitemap-corpus-bulk.xml
new file mode 100644
index 000000000..a30d4fd1e
--- /dev/null
+++ b/sitemap-corpus-bulk.xml
@@ -0,0 +1,27 @@
+---
+# sitemap-corpus-bulk.xml — /corpus/monographs/* + /corpus/taulib/* (~1,129)
+#
+# Auto-generated TauLib-derived monograph and library reference pages.
+#
+# Inclusion rule: classifier `_bucket == "corpus-bulk"` from
+# `_includes/sitemap-bucket.liquid`.
+layout: null
+permalink: /sitemap-corpus-bulk.xml
+sitemap: false
+---
+{%- assign all_items = site.html_pages | concat: site.documents -%}
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+{%- for item in all_items -%}
+  {%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
+  {%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
+  {%- include sitemap-bucket.liquid url=item.url -%}
+  {%- unless _bucket == "corpus-bulk" -%}{%- continue -%}{%- endunless -%}
+  {%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
+  <url>
+    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
+    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
+  </url>
+{%- endfor -%}
+
+</urlset>
diff --git a/sitemap-predictions.xml b/sitemap-predictions.xml
new file mode 100644
index 000000000..0ea408a02
--- /dev/null
+++ b/sitemap-predictions.xml
@@ -0,0 +1,29 @@
+---
+# sitemap-predictions.xml — /predictions/* prediction pages (~67)
+#
+# Auto-generated prediction record pages (separate from the
+# /results/predictions/ facet aggregator pages, which live in
+# sitemap-results-bulk.xml).
+#
+# Inclusion rule: classifier `_bucket == "predictions"` from
+# `_includes/sitemap-bucket.liquid`.
+layout: null
+permalink: /sitemap-predictions.xml
+sitemap: false
+---
+{%- assign all_items = site.html_pages | concat: site.documents -%}
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+{%- for item in all_items -%}
+  {%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
+  {%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
+  {%- include sitemap-bucket.liquid url=item.url -%}
+  {%- unless _bucket == "predictions" -%}{%- continue -%}{%- endunless -%}
+  {%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
+  <url>
+    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
+    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
+  </url>
+{%- endfor -%}
+
+</urlset>
diff --git a/sitemap-registry.xml b/sitemap-registry.xml
new file mode 100644
index 000000000..7c1bd32b2
--- /dev/null
+++ b/sitemap-registry.xml
@@ -0,0 +1,29 @@
+---
+# sitemap-registry.xml — /registry/* auto-generated registry objects (~4,570)
+#
+# These are programmatically generated from the canonical registry corpus.
+# Pulled into a separate sitemap so GSC can report indexing progress on
+# this bulk content type independently from human-authored core pages.
+#
+# Inclusion rule: classifier `_bucket == "registry"` from
+# `_includes/sitemap-bucket.liquid`.
+layout: null
+permalink: /sitemap-registry.xml
+sitemap: false
+---
+{%- assign all_items = site.html_pages | concat: site.documents -%}
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+{%- for item in all_items -%}
+  {%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
+  {%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
+  {%- include sitemap-bucket.liquid url=item.url -%}
+  {%- unless _bucket == "registry" -%}{%- continue -%}{%- endunless -%}
+  {%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
+  <url>
+    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
+    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
+  </url>
+{%- endfor -%}
+
+</urlset>
diff --git a/sitemap-results-bulk.xml b/sitemap-results-bulk.xml
new file mode 100644
index 000000000..c1ab9cb98
--- /dev/null
+++ b/sitemap-results-bulk.xml
@@ -0,0 +1,32 @@
+---
+# sitemap-results-bulk.xml — auto-generated /results/* facet/path pages (~925)
+#
+# Covers /results/{additional-noteworthy-results, problem, physics, life,
+# metaphysics, mathematics, calibration-cascade, falsifications,
+# predictions}/* — the templated facet structure under the Results lane.
+#
+# Curated /results/ overview pages (e.g. /results/, /results/predictions/browse/,
+# /results/falsifications/browse/) stay in sitemap-core.xml.
+#
+# Inclusion rule: classifier `_bucket == "results-bulk"` from
+# `_includes/sitemap-bucket.liquid`.
+layout: null
+permalink: /sitemap-results-bulk.xml
+sitemap: false
+---
+{%- assign all_items = site.html_pages | concat: site.documents -%}
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+{%- for item in all_items -%}
+  {%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
+  {%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
+  {%- include sitemap-bucket.liquid url=item.url -%}
+  {%- unless _bucket == "results-bulk" -%}{%- continue -%}{%- endunless -%}
+  {%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
+  <url>
+    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
+    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
+  </url>
+{%- endfor -%}
+
+</urlset>
diff --git a/sitemap.xml b/sitemap.xml
index 3e06c42f5..0ea5ae265 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -1,44 +1,54 @@
 ---
-# sitemap.xml — custom Jekyll sitemap that overrides jekyll-sitemap
+# sitemap.xml — sitemap INDEX
 #
-# When a sitemap.xml file exists at site source root, jekyll-sitemap defers to
-# it and skips its own generation (per the plugin's `Site#generate` guard). This
-# template emits one <url> per published page/document with a non-empty <lastmod>
-# for every URL — including the 86% of pages (registry objects, prediction
-# pages, falsification packs, results, ...) where the upstream plugin emitted
-# no <lastmod> because they have no `last_modified_at` frontmatter.
+# As of May 2026, the sitemap is split into six child sitemaps so Google
+# Search Console can report indexing progress per content type, and so the
+# crawler can prioritize human-authored canonical pages over the 7,800+
+# auto-generated registry/bibliography/monograph/results-facet pages.
 #
-# lastmod fallback chain (in order):
-#   1. page.last_modified_at   — jekyll-last-modified-at convention (not used yet)
-#   2. page.last_updated       — this project's canonical "page version" field
-#   3. page.date               — Jekyll-auto for posts; explicit on some pages
-#   4. site.time               — build time (always available; never null)
+# Child sitemaps (mutually exclusive, classification by URL prefix in
+# `_includes/sitemap-bucket.liquid`):
 #
-# Exclusion rules:
-#   - sitemap: false           — page-level opt-out
-#   - layout: redirect         — legacy redirect stubs (already noindex'd; the
-#                                jekyll-sitemap convention is to exclude these)
-#   - .html files with no `output: true` — Jekyll handles this automatically;
-#                                          unpublished docs do not appear in
-#                                          site.documents.
+#   /sitemap-core.xml            ~1,047 human-authored L0-L4 pages
+#   /sitemap-registry.xml        ~4,570 /registry/* registry objects
+#   /sitemap-bibliography.xml    ~1,149 /bibliography/* references
+#   /sitemap-corpus-bulk.xml     ~1,129 /corpus/monographs/* + /corpus/taulib/*
+#   /sitemap-results-bulk.xml      ~925 /results/* facet + predictions/falsifications
+#   /sitemap-predictions.xml       ~67  /predictions/* prediction pages
 #
-# This file uses no Jekyll layout (`layout: null`) and is itself opted out of
-# its own sitemap (`sitemap: false`).
+# Total URLs across children equal the single-file v1 count (~8,875).
+#
+# robots.txt references /sitemap.xml — unchanged. Google auto-discovers
+# child sitemaps from the index. Last-modified for each child is the build
+# time, which is correct: any change in source triggers a full rebuild.
 layout: null
 permalink: /sitemap.xml
 sitemap: false
 ---
-{%- assign all_items = site.html_pages | concat: site.documents -%}
 <?xml version="1.0" encoding="UTF-8"?>
-<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
-{%- for item in all_items -%}
-{%- if item.sitemap == false -%}{%- continue -%}{%- endif -%}
-{%- if item.layout == "redirect" -%}{%- continue -%}{%- endif -%}
-{%- assign _mod = item.last_modified_at | default: item.last_updated | default: item.date | default: site.time -%}
-  <url>
-    <loc>{{ item.url | absolute_url | xml_escape }}</loc>
-    <lastmod>{{ _mod | date_to_xmlschema }}</lastmod>
-  </url>
-{%- endfor -%}
-
-</urlset>
+<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+  <sitemap>
+    <loc>{{ "/sitemap-core.xml" | absolute_url }}</loc>
+    <lastmod>{{ site.time | date_to_xmlschema }}</lastmod>
+  </sitemap>
+  <sitemap>
+    <loc>{{ "/sitemap-registry.xml" | absolute_url }}</loc>
+    <lastmod>{{ site.time | date_to_xmlschema }}</lastmod>
+  </sitemap>
+  <sitemap>
+    <loc>{{ "/sitemap-bibliography.xml" | absolute_url }}</loc>
+    <lastmod>{{ site.time | date_to_xmlschema }}</lastmod>
+  </sitemap>
+  <sitemap>
+    <loc>{{ "/sitemap-corpus-bulk.xml" | absolute_url }}</loc>
+    <lastmod>{{ site.time | date_to_xmlschema }}</lastmod>
+  </sitemap>
+  <sitemap>
+    <loc>{{ "/sitemap-results-bulk.xml" | absolute_url }}</loc>
+    <lastmod>{{ site.time | date_to_xmlschema }}</lastmod>
+  </sitemap>
+  <sitemap>
+    <loc>{{ "/sitemap-predictions.xml" | absolute_url }}</loc>
+    <lastmod>{{ site.time | date_to_xmlschema }}</lastmod>
+  </sitemap>
+</sitemapindex>

From 6d73ed8bc4ade357fc7394dd92f9ab421e5c52a7 Mon Sep 17 00:00:00 2001
From: Thorsten Fuchs <thorfuchs@outlook.com>
Date: Tue, 12 May 2026 00:16:00 +0200
Subject: [PATCH 2/2] ci(smoke-test): update sitemap integrity check for
 sitemap index
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After the sitemap split (this same PR), /sitemap.xml is a
<sitemapindex> referencing six child sitemaps — so the old
"grep -c '<loc>' sitemap.xml" check finds only 6 entries
(one per child) and fails the ≥100 URL threshold.

Replace with:

  1. Assert /sitemap.xml is a <sitemapindex>
  2. Assert all six child sitemap files exist
  3. Assert each child has ≥ its expected minimum URL count
  4. Assert total URLs across children ≥ 5000 (canonical ~8,864)

Per-child minimums are set well below the canonical counts (4,570
registry, 1,136 bibliography, 1,134 corpus-bulk, 1,018 core, 939
results-bulk, 67 predictions) so legitimate content growth or
small re-classifications do not flake the check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/scripts/smoke-test.sh | 46 ++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/.github/scripts/smoke-test.sh b/.github/scripts/smoke-test.sh
index 3d9a9b02f..a940fe79e 100755
--- a/.github/scripts/smoke-test.sh
+++ b/.github/scripts/smoke-test.sh
@@ -171,11 +171,49 @@ echo "── robots.txt + sitemap.xml integrity ──────────
 file_contains "/robots.txt" "Content-Signal"                        "Content-Signal directive present"
 file_contains "/robots.txt" "Sitemap: https://panta-rhei.site"      "sitemap reference present"
 
-loc_count=$(grep -c '<loc>' "$SITE/sitemap.xml" 2>/dev/null || echo "0")
-if [ "$loc_count" -ge 100 ]; then
-  pass "sitemap has ${loc_count} URLs (expected ≥100)"
+# /sitemap.xml is a sitemap INDEX referencing six child sitemaps (see
+# _includes/sitemap-bucket.liquid). Validate:
+#   1. /sitemap.xml is a <sitemapindex> (not a <urlset>)
+#   2. All six child sitemap files exist
+#   3. Each child contains a non-trivial number of <loc> URLs
+#   4. Total URLs across children ≥ 5000 (canonical ~8,864 on prod)
+if grep -q '<sitemapindex' "$SITE/sitemap.xml" 2>/dev/null; then
+  pass "sitemap.xml is a <sitemapindex>"
 else
-  fail "sitemap has only ${loc_count} URLs (expected ≥100)"
+  fail "sitemap.xml is not a <sitemapindex> — expected sitemap index format"
+fi
+CHECK_COUNT=$((CHECK_COUNT+1))
+
+total_locs=0
+declare -A child_min=(
+  ["sitemap-core.xml"]=500
+  ["sitemap-registry.xml"]=4000
+  ["sitemap-bibliography.xml"]=1000
+  ["sitemap-corpus-bulk.xml"]=1000
+  ["sitemap-results-bulk.xml"]=500
+  ["sitemap-predictions.xml"]=20
+)
+for child in sitemap-core.xml sitemap-registry.xml sitemap-bibliography.xml sitemap-corpus-bulk.xml sitemap-results-bulk.xml sitemap-predictions.xml; do
+  if [ ! -f "$SITE/$child" ]; then
+    fail "MISSING /$child"
+    CHECK_COUNT=$((CHECK_COUNT+1))
+    continue
+  fi
+  child_locs=$(grep -c '<loc>' "$SITE/$child" 2>/dev/null || echo "0")
+  total_locs=$((total_locs + child_locs))
+  min_expected=${child_min[$child]}
+  if [ "$child_locs" -ge "$min_expected" ]; then
+    pass "/$child has ${child_locs} URLs (≥${min_expected} expected)"
+  else
+    fail "/$child has only ${child_locs} URLs (expected ≥${min_expected})"
+  fi
+  CHECK_COUNT=$((CHECK_COUNT+1))
+done
+
+if [ "$total_locs" -ge 5000 ]; then
+  pass "sitemap total URLs across children: ${total_locs} (≥5000 expected)"
+else
+  fail "sitemap total URLs across children: only ${total_locs} (expected ≥5000)"
 fi
 CHECK_COUNT=$((CHECK_COUNT+1))