Skip to content

seo(sitemap): split single sitemap into sitemap index + 6 child sitemaps#209

Merged
ThorFuchs merged 2 commits into
mainfrom
seo/sitemap-index-with-six-children
May 11, 2026
Merged

seo(sitemap): split single sitemap into sitemap index + 6 child sitemaps#209
ThorFuchs merged 2 commits into
mainfrom
seo/sitemap-index-with-six-children

Conversation

@ThorFuchs
Copy link
Copy Markdown
Collaborator

Summary

The single /sitemap.xml was emitting all ~8,875 URLs in one urlset — meaning the 7,800+ auto-generated registry/bibliography/monograph/results-facet/prediction pages were competing with the ~1,000 human-authored canonical L0–L4 pages for the same crawl-budget signal. In Google Search Console, the report aggregator could not distinguish indexing progress on canonical content from indexing progress on bulk programmatic content (current indexing rate ~42%, dominated by the 5,700+ templated pages).

This PR turns /sitemap.xml into a sitemap index referencing six mutually-exclusive child sitemaps, classified by URL prefix in a single source of truth (_includes/sitemap-bucket.liquid).

Six child sitemaps

Child sitemap URL filter URLs
/sitemap-core.xml everything not matched below 1,018 human-authored L0–L4
/sitemap-registry.xml /registry/* 4,570
/sitemap-bibliography.xml /bibliography/* 1,136
/sitemap-corpus-bulk.xml /corpus/monographs/*, /corpus/taulib/* 1,134
/sitemap-results-bulk.xml /results/{additional-noteworthy-results,problem,physics,life,metaphysics,mathematics,calibration-cascade,falsifications,predictions}/* 939
/sitemap-predictions.xml /predictions/* 67
Total across children 8,864 (= single-file v1 exactly)

Why this helps SEO indexing

  1. GSC per-sitemap reporting. You'll see whether the 4,570 registry pages indexing rate diverges from the 1,018 canonical-content pages indexing rate, and can prioritize fixes accordingly.

  2. Sitemap-as-signal. Google treats sitemap files as distinct prioritization signals. A focused 175KB sitemap-core.xml (1,018 pages) is far more likely to be fully crawled than the previous 1.2MB single file mixing canonical with bulk programmatic content.

  3. Independent scaling. Crawl budget for registry/bibliography/monograph/results bulk content can now scale separately from canonical content over months as Google decides how much of each templated collection to index.

Implementation

  • _includes/sitemap-bucket.liquid — deterministic URL-prefix classifier assigning each item to exactly one of 6 buckets (verified mutually exclusive: 0 overlapping URLs across children).
  • sitemap.xml — rewritten from <urlset> to <sitemapindex> listing the 6 children with <lastmod>{{ site.time }}</lastmod> (any source change triggers a full rebuild).
  • 6 new child sitemap templates — each filtering items by _bucket == "<name>" after the existing redirect and sitemap: false exclusion logic.

Verified locally

  • ✅ All 6 child sitemaps build correctly
  • ✅ Total URLs across children = 8,864 (matches the previous single-file count exactly)
  • ✅ Mutual-exclusivity check: 0 URLs appear in two or more children
  • ✅ All 18 canonical L1/L2/L3 spot-check URLs land in sitemap-core.xml:
    • Homepage, all 11 lane roots
    • Wolfram comparison page, construction-spine Steps 1/2/4, substrate-non-deferral, release-manifest
    • WP000 + WP004 anchor documents
  • ✅ Two browse aggregator pages (/results/{predictions,falsifications}/browse/) land in sitemap-results-bulk by URL-prefix rule — semantically correct since they aggregate templated content

robots.txt and external references

  • robots.txt continues to reference /sitemap.xml — Google auto-discovers child sitemaps from the index. No change required.
  • No internal Liquid/include code references the old sitemap structure (verified via grep of _includes/, _layouts/, _config.yml).
  • GSC will need to re-fetch the sitemap to see the index structure; this happens automatically on next crawl, typically within hours of deploy.

Test plan

  • CI passes
  • On the deployed site:
    • https://panta-rhei.site/sitemap.xml returns a <sitemapindex> listing 6 children
    • All 6 child sitemap URLs return valid <urlset> XML
    • Spot-check: sitemap-core.xml contains the Wolfram page, anchor docs, construction-spine steps
    • Spot-check: sitemap-registry.xml contains 4,570 entries, all under /registry/*
  • Re-submit /sitemap.xml in GSC → confirm it now appears as "Sitemap index"
  • After 24–48h, GSC Sitemaps tab should show per-child URL counts roughly matching the table above

🤖 Generated with Claude Code

Before: a single /sitemap.xml urlset emitted ~8,875 URLs, with the 7,800+
auto-generated registry/bibliography/monograph/results-facet/prediction
pages competing for the same crawl-budget signal as the ~1,000
human-authored canonical L0-L4 pages. In Google Search Console, the
report aggregator therefore could not distinguish indexing progress on
canonical content from indexing progress on bulk programmatic content.

After: /sitemap.xml is a sitemap INDEX referencing six mutually-
exclusive child sitemaps, classified by URL prefix in the single
source of truth `_includes/sitemap-bucket.liquid`:

  /sitemap-core.xml          1,018 human-authored L0-L4 pages
  /sitemap-registry.xml      4,570 /registry/* registry objects
  /sitemap-bibliography.xml  1,136 /bibliography/* references
  /sitemap-corpus-bulk.xml   1,134 /corpus/monographs/* + /corpus/taulib/*
  /sitemap-results-bulk.xml    939 /results/{additional-noteworthy-results,
                                    problem,physics,life,metaphysics,
                                    mathematics,calibration-cascade,
                                    falsifications,predictions}/*
  /sitemap-predictions.xml      67 /predictions/*
  -------------------------------------------------------
  TOTAL                      8,864  (matches single-file v1 exactly)

### Why this helps

1. GSC can now report per-sitemap indexing status. The user can see
   whether the 4,570 registry pages indexing rate diverges from the
   1,018 canonical-content pages indexing rate, and prioritize fixes
   accordingly.

2. Google's crawler treats sitemap files as distinct signals. A small
   focused sitemap-core (1,018 pages, 175KB) is far more likely to be
   fully crawled and indexed than a single 1.2MB file mixing canonical
   content with bulk programmatic content.

3. Crawl budget for the registry/bibliography/monograph bulk content
   can now scale independently from canonical content over time, as
   Google decides how much of each templated collection to index.

### Implementation

- `_includes/sitemap-bucket.liquid`: deterministic URL-prefix
  classifier that assigns each item to exactly one of six buckets
  (verified mutually exclusive: 0 overlapping URLs across children).
- `sitemap.xml`: rewritten from urlset to sitemapindex listing the
  six children with `<lastmod>{{ site.time }}</lastmod>` (any source
  change triggers a full rebuild).
- Six new child sitemap templates, each filtering items by
  `_bucket == "<name>"` after redirect and `sitemap: false` exclusion.

### Verified locally

- All six child sitemaps build correctly via `bundle exec jekyll build`.
- Total URLs across children = 8,864 (matches the previous single-file
  count exactly; no URL lost or duplicated).
- Mutual-exclusivity check: 0 URLs appear in two or more children.
- All 18 canonical L1/L2/L3 spot-check URLs land in sitemap-core
  (homepage, all 11 lane roots, Wolfram comparison, construction-spine
  steps, substrate-non-deferral, release-manifest, anchor docs WP000
  and WP004). Browse aggregator pages
  (/results/{predictions,falsifications}/browse/) land in
  sitemap-results-bulk by URL-prefix rule, which is semantically
  correct (they aggregate templated content).

### robots.txt and external references

- robots.txt continues to reference `/sitemap.xml`. Google
  auto-discovers child sitemaps from the index — no change required.
- No internal Liquid/include code references the old sitemap structure
  (verified via grep of `_includes/`, `_layouts/`, `_config.yml`).
- GSC will need to re-fetch the sitemap to see the index structure;
  this happens automatically on next crawl, typically within hours.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ThorFuchs ThorFuchs requested a review from AnSoFuchs as a code owner May 11, 2026 22:07
After the sitemap split (this same PR), /sitemap.xml is a
<sitemapindex> referencing six child sitemaps — so the old
"grep -c '<loc>' sitemap.xml" check finds only 6 entries
(one per child) and fails the ≥100 URL threshold.

Replace with:

  1. Assert /sitemap.xml is a <sitemapindex>
  2. Assert all six child sitemap files exist
  3. Assert each child has ≥ its expected minimum URL count
  4. Assert total URLs across children ≥ 5000 (canonical ~8,864)

Per-child minimums are set well below the canonical counts (4,570
registry, 1,136 bibliography, 1,134 corpus-bulk, 1,018 core, 939
results-bulk, 67 predictions) so legitimate content growth or
small re-classifications do not flake the check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ThorFuchs ThorFuchs merged commit 97112b3 into main May 11, 2026
5 checks passed
@ThorFuchs ThorFuchs deleted the seo/sitemap-index-with-six-children branch May 11, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant