seo(sitemap): split single sitemap into sitemap index + 6 child sitemaps by ThorFuchs · Pull Request #209 · Panta-Rhei-Research/site

ThorFuchs · 2026-05-11T22:07:57Z

Summary

The single /sitemap.xml was emitting all ~8,875 URLs in one urlset — meaning the 7,800+ auto-generated registry/bibliography/monograph/results-facet/prediction pages were competing with the ~1,000 human-authored canonical L0–L4 pages for the same crawl-budget signal. In Google Search Console, the report aggregator could not distinguish indexing progress on canonical content from indexing progress on bulk programmatic content (current indexing rate ~42%, dominated by the 5,700+ templated pages).

This PR turns /sitemap.xml into a sitemap index referencing six mutually-exclusive child sitemaps, classified by URL prefix in a single source of truth (_includes/sitemap-bucket.liquid).

Six child sitemaps

Child sitemap	URL filter	URLs
`/sitemap-core.xml`	everything not matched below	1,018 human-authored L0–L4
`/sitemap-registry.xml`	`/registry/*`	4,570
`/sitemap-bibliography.xml`	`/bibliography/*`	1,136
`/sitemap-corpus-bulk.xml`	`/corpus/monographs/`, `/corpus/taulib/`	1,134
`/sitemap-results-bulk.xml`	`/results/{additional-noteworthy-results,problem,physics,life,metaphysics,mathematics,calibration-cascade,falsifications,predictions}/*`	939
`/sitemap-predictions.xml`	`/predictions/*`	67
Total across children		8,864 (= single-file v1 exactly)

Why this helps SEO indexing

GSC per-sitemap reporting. You'll see whether the 4,570 registry pages indexing rate diverges from the 1,018 canonical-content pages indexing rate, and can prioritize fixes accordingly.
Sitemap-as-signal. Google treats sitemap files as distinct prioritization signals. A focused 175KB sitemap-core.xml (1,018 pages) is far more likely to be fully crawled than the previous 1.2MB single file mixing canonical with bulk programmatic content.
Independent scaling. Crawl budget for registry/bibliography/monograph/results bulk content can now scale separately from canonical content over months as Google decides how much of each templated collection to index.

Implementation

_includes/sitemap-bucket.liquid — deterministic URL-prefix classifier assigning each item to exactly one of 6 buckets (verified mutually exclusive: 0 overlapping URLs across children).
sitemap.xml — rewritten from <urlset> to <sitemapindex> listing the 6 children with <lastmod>{{ site.time }}</lastmod> (any source change triggers a full rebuild).
6 new child sitemap templates — each filtering items by _bucket == "<name>" after the existing redirect and sitemap: false exclusion logic.

Verified locally

✅ All 6 child sitemaps build correctly
✅ Total URLs across children = 8,864 (matches the previous single-file count exactly)
✅ Mutual-exclusivity check: 0 URLs appear in two or more children
✅ All 18 canonical L1/L2/L3 spot-check URLs land in sitemap-core.xml:
- Homepage, all 11 lane roots
- Wolfram comparison page, construction-spine Steps 1/2/4, substrate-non-deferral, release-manifest
- WP000 + WP004 anchor documents
✅ Two browse aggregator pages (/results/{predictions,falsifications}/browse/) land in sitemap-results-bulk by URL-prefix rule — semantically correct since they aggregate templated content

robots.txt and external references

robots.txt continues to reference /sitemap.xml — Google auto-discovers child sitemaps from the index. No change required.
No internal Liquid/include code references the old sitemap structure (verified via grep of _includes/, _layouts/, _config.yml).
GSC will need to re-fetch the sitemap to see the index structure; this happens automatically on next crawl, typically within hours of deploy.

Test plan

CI passes
On the deployed site:
- https://panta-rhei.site/sitemap.xml returns a <sitemapindex> listing 6 children
- All 6 child sitemap URLs return valid <urlset> XML
- Spot-check: sitemap-core.xml contains the Wolfram page, anchor docs, construction-spine steps
- Spot-check: sitemap-registry.xml contains 4,570 entries, all under /registry/*
Re-submit /sitemap.xml in GSC → confirm it now appears as "Sitemap index"
After 24–48h, GSC Sitemaps tab should show per-child URL counts roughly matching the table above

🤖 Generated with Claude Code

Before: a single /sitemap.xml urlset emitted ~8,875 URLs, with the 7,800+ auto-generated registry/bibliography/monograph/results-facet/prediction pages competing for the same crawl-budget signal as the ~1,000 human-authored canonical L0-L4 pages. In Google Search Console, the report aggregator therefore could not distinguish indexing progress on canonical content from indexing progress on bulk programmatic content. After: /sitemap.xml is a sitemap INDEX referencing six mutually- exclusive child sitemaps, classified by URL prefix in the single source of truth `_includes/sitemap-bucket.liquid`: /sitemap-core.xml 1,018 human-authored L0-L4 pages /sitemap-registry.xml 4,570 /registry/* registry objects /sitemap-bibliography.xml 1,136 /bibliography/* references /sitemap-corpus-bulk.xml 1,134 /corpus/monographs/* + /corpus/taulib/* /sitemap-results-bulk.xml 939 /results/{additional-noteworthy-results, problem,physics,life,metaphysics, mathematics,calibration-cascade, falsifications,predictions}/* /sitemap-predictions.xml 67 /predictions/* ------------------------------------------------------- TOTAL 8,864 (matches single-file v1 exactly) ### Why this helps 1. GSC can now report per-sitemap indexing status. The user can see whether the 4,570 registry pages indexing rate diverges from the 1,018 canonical-content pages indexing rate, and prioritize fixes accordingly. 2. Google's crawler treats sitemap files as distinct signals. A small focused sitemap-core (1,018 pages, 175KB) is far more likely to be fully crawled and indexed than a single 1.2MB file mixing canonical content with bulk programmatic content. 3. Crawl budget for the registry/bibliography/monograph bulk content can now scale independently from canonical content over time, as Google decides how much of each templated collection to index. ### Implementation - `_includes/sitemap-bucket.liquid`: deterministic URL-prefix classifier that assigns each item to exactly one of six buckets (verified mutually exclusive: 0 overlapping URLs across children). - `sitemap.xml`: rewritten from urlset to sitemapindex listing the six children with `<lastmod>{{ site.time }}</lastmod>` (any source change triggers a full rebuild). - Six new child sitemap templates, each filtering items by `_bucket == "<name>"` after redirect and `sitemap: false` exclusion. ### Verified locally - All six child sitemaps build correctly via `bundle exec jekyll build`. - Total URLs across children = 8,864 (matches the previous single-file count exactly; no URL lost or duplicated). - Mutual-exclusivity check: 0 URLs appear in two or more children. - All 18 canonical L1/L2/L3 spot-check URLs land in sitemap-core (homepage, all 11 lane roots, Wolfram comparison, construction-spine steps, substrate-non-deferral, release-manifest, anchor docs WP000 and WP004). Browse aggregator pages (/results/{predictions,falsifications}/browse/) land in sitemap-results-bulk by URL-prefix rule, which is semantically correct (they aggregate templated content). ### robots.txt and external references - robots.txt continues to reference `/sitemap.xml`. Google auto-discovers child sitemaps from the index — no change required. - No internal Liquid/include code references the old sitemap structure (verified via grep of `_includes/`, `_layouts/`, `_config.yml`). - GSC will need to re-fetch the sitemap to see the index structure; this happens automatically on next crawl, typically within hours. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After the sitemap split (this same PR), /sitemap.xml is a <sitemapindex> referencing six child sitemaps — so the old "grep -c '<loc>' sitemap.xml" check finds only 6 entries (one per child) and fails the ≥100 URL threshold. Replace with: 1. Assert /sitemap.xml is a <sitemapindex> 2. Assert all six child sitemap files exist 3. Assert each child has ≥ its expected minimum URL count 4. Assert total URLs across children ≥ 5000 (canonical ~8,864) Per-child minimums are set well below the canonical counts (4,570 registry, 1,136 bibliography, 1,134 corpus-bulk, 1,018 core, 939 results-bulk, 67 predictions) so legitimate content growth or small re-classifications do not flake the check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ThorFuchs requested a review from AnSoFuchs as a code owner May 11, 2026 22:07

ThorFuchs merged commit 97112b3 into main May 11, 2026
5 checks passed

ThorFuchs deleted the seo/sitemap-index-with-six-children branch May 11, 2026 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seo(sitemap): split single sitemap into sitemap index + 6 child sitemaps#209

seo(sitemap): split single sitemap into sitemap index + 6 child sitemaps#209
ThorFuchs merged 2 commits into
mainfrom
seo/sitemap-index-with-six-children

ThorFuchs commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ThorFuchs commented May 11, 2026

Summary

Six child sitemaps

Why this helps SEO indexing

Implementation

Verified locally

robots.txt and external references

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant