seo(sitemap): split single sitemap into sitemap index + 6 child sitemaps#209
Merged
Conversation
Before: a single /sitemap.xml urlset emitted ~8,875 URLs, with the 7,800+
auto-generated registry/bibliography/monograph/results-facet/prediction
pages competing for the same crawl-budget signal as the ~1,000
human-authored canonical L0-L4 pages. In Google Search Console, the
report aggregator therefore could not distinguish indexing progress on
canonical content from indexing progress on bulk programmatic content.
After: /sitemap.xml is a sitemap INDEX referencing six mutually-
exclusive child sitemaps, classified by URL prefix in the single
source of truth `_includes/sitemap-bucket.liquid`:
/sitemap-core.xml 1,018 human-authored L0-L4 pages
/sitemap-registry.xml 4,570 /registry/* registry objects
/sitemap-bibliography.xml 1,136 /bibliography/* references
/sitemap-corpus-bulk.xml 1,134 /corpus/monographs/* + /corpus/taulib/*
/sitemap-results-bulk.xml 939 /results/{additional-noteworthy-results,
problem,physics,life,metaphysics,
mathematics,calibration-cascade,
falsifications,predictions}/*
/sitemap-predictions.xml 67 /predictions/*
-------------------------------------------------------
TOTAL 8,864 (matches single-file v1 exactly)
### Why this helps
1. GSC can now report per-sitemap indexing status. The user can see
whether the 4,570 registry pages indexing rate diverges from the
1,018 canonical-content pages indexing rate, and prioritize fixes
accordingly.
2. Google's crawler treats sitemap files as distinct signals. A small
focused sitemap-core (1,018 pages, 175KB) is far more likely to be
fully crawled and indexed than a single 1.2MB file mixing canonical
content with bulk programmatic content.
3. Crawl budget for the registry/bibliography/monograph bulk content
can now scale independently from canonical content over time, as
Google decides how much of each templated collection to index.
### Implementation
- `_includes/sitemap-bucket.liquid`: deterministic URL-prefix
classifier that assigns each item to exactly one of six buckets
(verified mutually exclusive: 0 overlapping URLs across children).
- `sitemap.xml`: rewritten from urlset to sitemapindex listing the
six children with `<lastmod>{{ site.time }}</lastmod>` (any source
change triggers a full rebuild).
- Six new child sitemap templates, each filtering items by
`_bucket == "<name>"` after redirect and `sitemap: false` exclusion.
### Verified locally
- All six child sitemaps build correctly via `bundle exec jekyll build`.
- Total URLs across children = 8,864 (matches the previous single-file
count exactly; no URL lost or duplicated).
- Mutual-exclusivity check: 0 URLs appear in two or more children.
- All 18 canonical L1/L2/L3 spot-check URLs land in sitemap-core
(homepage, all 11 lane roots, Wolfram comparison, construction-spine
steps, substrate-non-deferral, release-manifest, anchor docs WP000
and WP004). Browse aggregator pages
(/results/{predictions,falsifications}/browse/) land in
sitemap-results-bulk by URL-prefix rule, which is semantically
correct (they aggregate templated content).
### robots.txt and external references
- robots.txt continues to reference `/sitemap.xml`. Google
auto-discovers child sitemaps from the index — no change required.
- No internal Liquid/include code references the old sitemap structure
(verified via grep of `_includes/`, `_layouts/`, `_config.yml`).
- GSC will need to re-fetch the sitemap to see the index structure;
this happens automatically on next crawl, typically within hours.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the sitemap split (this same PR), /sitemap.xml is a <sitemapindex> referencing six child sitemaps — so the old "grep -c '<loc>' sitemap.xml" check finds only 6 entries (one per child) and fails the ≥100 URL threshold. Replace with: 1. Assert /sitemap.xml is a <sitemapindex> 2. Assert all six child sitemap files exist 3. Assert each child has ≥ its expected minimum URL count 4. Assert total URLs across children ≥ 5000 (canonical ~8,864) Per-child minimums are set well below the canonical counts (4,570 registry, 1,136 bibliography, 1,134 corpus-bulk, 1,018 core, 939 results-bulk, 67 predictions) so legitimate content growth or small re-classifications do not flake the check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The single
/sitemap.xmlwas emitting all ~8,875 URLs in one urlset — meaning the 7,800+ auto-generated registry/bibliography/monograph/results-facet/prediction pages were competing with the ~1,000 human-authored canonical L0–L4 pages for the same crawl-budget signal. In Google Search Console, the report aggregator could not distinguish indexing progress on canonical content from indexing progress on bulk programmatic content (current indexing rate ~42%, dominated by the 5,700+ templated pages).This PR turns
/sitemap.xmlinto a sitemap index referencing six mutually-exclusive child sitemaps, classified by URL prefix in a single source of truth (_includes/sitemap-bucket.liquid).Six child sitemaps
/sitemap-core.xml/sitemap-registry.xml/registry/*/sitemap-bibliography.xml/bibliography/*/sitemap-corpus-bulk.xml/corpus/monographs/*,/corpus/taulib/*/sitemap-results-bulk.xml/results/{additional-noteworthy-results,problem,physics,life,metaphysics,mathematics,calibration-cascade,falsifications,predictions}/*/sitemap-predictions.xml/predictions/*Why this helps SEO indexing
GSC per-sitemap reporting. You'll see whether the 4,570 registry pages indexing rate diverges from the 1,018 canonical-content pages indexing rate, and can prioritize fixes accordingly.
Sitemap-as-signal. Google treats sitemap files as distinct prioritization signals. A focused 175KB
sitemap-core.xml(1,018 pages) is far more likely to be fully crawled than the previous 1.2MB single file mixing canonical with bulk programmatic content.Independent scaling. Crawl budget for registry/bibliography/monograph/results bulk content can now scale separately from canonical content over months as Google decides how much of each templated collection to index.
Implementation
_includes/sitemap-bucket.liquid— deterministic URL-prefix classifier assigning each item to exactly one of 6 buckets (verified mutually exclusive: 0 overlapping URLs across children).sitemap.xml— rewritten from<urlset>to<sitemapindex>listing the 6 children with<lastmod>{{ site.time }}</lastmod>(any source change triggers a full rebuild)._bucket == "<name>"after the existing redirect andsitemap: falseexclusion logic.Verified locally
sitemap-core.xml:/results/{predictions,falsifications}/browse/) land insitemap-results-bulkby URL-prefix rule — semantically correct since they aggregate templated contentrobots.txt and external references
robots.txtcontinues to reference/sitemap.xml— Google auto-discovers child sitemaps from the index. No change required._includes/,_layouts/,_config.yml).Test plan
https://panta-rhei.site/sitemap.xmlreturns a<sitemapindex>listing 6 children<urlset>XMLsitemap-core.xmlcontains the Wolfram page, anchor docs, construction-spine stepssitemap-registry.xmlcontains 4,570 entries, all under/registry/*/sitemap.xmlin GSC → confirm it now appears as "Sitemap index"🤖 Generated with Claude Code