A CI/CD-integrated SEO audit tool for static websites. Validates a freshly built site against 22 audit sections before the deploy goes live. Designed to be dropped into a build pipeline as a post-build, pre-traffic gate.
Originally built for speytech.com's atomic deploy pipeline (Astro static build, nginx, atomic dist swap). Generalisable to any static site that publishes a sitemap and serves over HTTPS.
The validator emulates Google's crawl behaviour and checks for the issues that most commonly degrade search visibility and break crawler trust:
- HTTP response codes for every URL in the sitemap
- Title, h1, meta description presence and length
- Canonical tag presence and correctness
- Trailing slash consistency (catches GSC "alternate page with canonical")
- Redirect chain detection
- Meta robots / noindex misconfiguration
- Open Graph tags (
og:title,og:image,og:description) - Viewport meta and favicon
- Heading hierarchy (scoped to
<main>/<article>content) - Image alt text coverage
- Internal broken links (
--thorough) - Orphan pages — in sitemap but not linked anywhere (
--thorough) - Duplicate titles and descriptions
- Mixed content (http resources on https pages)
- Soft 404 detection
robots.txt,llms.txt,llms-full.txtpresence- JSON-LD structured data
- Image-asset redirect verification (nginx-config-driven)
- Generic redirect-set verification (every exact-match nginx redirect)
- Orphan image audit (dist assets never referenced from HTML)
The last three are the v7.3 additions; see CHANGELOG-v7.3.0.md
for the motivation.
pip install requests beautifulsoup4 lxml
./seo_validator.py --domain example.comThis runs a sampled audit. For a full audit (every sitemap URL, every redirect verified):
./seo_validator.py --domain example.com --fullFor the full set including broken-link and orphan-page checks (adds HTTP requests, ~1–2 minutes on a typical site):
./seo_validator.py --domain example.com --full --thoroughExit code is 0 if all hard checks pass, 1 if any hard failure is detected.
INFO findings and warnings do not affect the exit code unless --strict
is passed.
| Flag | Description |
|---|---|
-d, --domain DOMAIN |
Domain to audit (required). |
-s, --sitemap URL |
Custom sitemap URL. Defaults to https://<domain>/sitemap-0.xml. |
-t, --timeout SECONDS |
Request timeout (default: 10). |
-w, --workers N |
Concurrent workers for URL checks (default: 10). |
--full |
Test every sitemap URL instead of sampling. |
--thorough |
Include broken-link and orphan-page checks. |
--strict |
Treat title/description length and orphan-image findings as failures. |
--skip-og |
Skip Open Graph audit. |
--skip-images |
Skip image alt text audit. |
--skip-image-redirects |
Skip the image-asset redirect audit (section 20). |
--skip-redirect-set |
Skip the generic redirect-set verification (section 21). |
--skip-orphan-images |
Skip the orphan image audit (section 22). |
--nginx-config PATH |
Path to nginx config for redirect discovery. Default: /etc/nginx/sites-available/speytech.com. |
--dist-path PATH |
Path to the built static site root for the orphan image audit. Default: ./dist. |
--image-dirs DIR[,DIR...] |
Comma-separated subdirectories under dist-path to scan for orphans. Default: images,og. |
-v, --version |
Print version and exit. |
-h, --help |
Print help and exit. |
- Sitemap fetch
- Basic SEO (status + core elements)
- Title and description length
- Meta robots (noindex detection)
- Open Graph tags (presence + og:image URL resolves to HTTP 2xx)
- Viewport meta (mobile-friendliness)
- Favicon
- Heading hierarchy
- Image alt text
- Canonical tags
- Trailing slash canonical (GSC "alternate page" check)
- Redirect chains
- Duplicate content (titles and descriptions)
- Mixed content (http on https)
- Soft 404 detection
- Internal broken links (
--thorough) - Orphan pages (
--thorough) - Critical files (
robots.txt,llms.txt,llms-full.txt) - JSON-LD schema validation
- Image-asset redirect audit
- Redirect set verification (with duplicate-source detection)
- Orphan image audit (walks
dist/images/anddist/og/)
============================================================
SEO Validator v7.3.0 - Comprehensive Site Audit
Domain: example.com
Mode: Sampling
============================================================
Fetching sitemap from: https://example.com/sitemap-0.xml ... OK (85 URLs)
Collecting page data for 85 URLs...
Collected data for 85 pages.
=== 2. Basic SEO Audit (Status + Core Elements) ===
✓ https://example.com/ -> 200 (title, h1, description)
...
=== 20. Image-Asset Redirect Audit ===
Discovered 5 image redirect rule(s) from /etc/nginx/sites-available/example.com. Verifying...
✓ /images/old-name-a.svg -> 301 -> /images/new-name-a.svg -> 200
✓ /images/old-name-b.svg -> 301 -> /images/new-name-b.svg -> 200
...
Image-asset redirect summary: 0 issues found
=== 21. Redirect Set Verification ===
Discovered 17 exact-match redirect rule(s) from /etc/nginx/sites-available/example.com. Verifying...
✓ /contact-us/ -> 301 -> /contact/ -> 200
✓ /home-temp/ -> 410
...
Redirect set summary: 0 issues found
=== 22. Orphan Image Audit ===
67 image asset(s) in dist, 67 referenced from rendered HTML or redirect targets.
Orphan image summary: 0 issues found
============================================================
AUDIT PASSED. All systems nominal.
============================================================
When something fails, the section reports the issue clearly and the final summary turns red:
=== 20. Image-Asset Redirect Audit ===
Discovered 5 image redirect rule(s) from /etc/nginx/sites-available/example.com. Verifying...
✓ /images/old-name-a.svg -> 301 -> /images/new-name-a.svg -> 200
✗ /images/old-name-b.svg -> 301 -> /images/new-name-b.svg -> 404 (target broken)
└─ Redirect target is broken. Either restore the target or remove the redirect.
...
Image-asset redirect summary: 1 issue found
============================================================
AUDIT FAILED. See errors above.
============================================================
| Code | Meaning |
|---|---|
| 0 | All hard checks passed. |
| 1 | One or more hard failures detected. |
INFO findings and warnings never affect the exit code unless --strict
is passed, in which case INFO findings escalate to failures.
When wiring into CI, run the validator without piping if you need the
exit code. A trailing | tee or | grep will mask the validator's
exit code with the pipe-tail's. Use ${PIPESTATUS[0]} in bash if you
must pipe:
# Correct: exit code captured cleanly
python3 seo_validator.py --domain example.com
echo $?
# Also correct (bash): exit code via PIPESTATUS
python3 seo_validator.py --domain example.com 2>&1 | tee /tmp/audit.log
echo "${PIPESTATUS[0]}"
# Incorrect: $? reports tee's exit code, not the validator's
python3 seo_validator.py --domain example.com | tee /tmp/audit.log
echo $? # always 0 if tee writes successfullyThe validator reads two filesystem paths:
-
nginx config (
--nginx-config): used by sections 20 and 21 to discover redirect rules. The validator extracts everylocation = X { return 301 Y; }andlocation = X { return 410; }rule. If the file is unreadable, a small inlineKNOWN_REDIRECTSfallback is used and a warning is emitted. -
dist path (
--dist-path): used by section 22 to walk the built site's image assets. Defaults to./dist. If the path is not a directory, section 22 skips with a warning.
Both paths are read-only. The validator never writes to disk.
- Python 3.8+
requestsbeautifulsoup4lxml(for XML sitemap parsing)
pip install requests beautifulsoup4 lxmlNo other runtime dependencies.
Designed to run after build, before traffic. The reference integration
in build.sh for speytech.com:
# Build to a staging directory
npx astro build --outDir ./dist.new
# Atomic swap
sudo mv dist dist.old
sudo mv dist.new dist
# Validate the live site post-swap
if ! python3 scripts/seo_validator.py --domain example.com; then
# Roll back to the previous build
sudo mv dist dist.failed
sudo mv dist.old dist
exit 1
fiThe validator hits the live HTTPS site, so it sees what real crawlers will see, including nginx redirects, security headers, and the full response chain.
MIT. See LICENSE.
Built for speytech.com's CI pipeline. The repo
sits inside the site repo as a deployment-time tool, but the script
itself is generic — it makes no speytech.com-specific assumptions beyond
the default --nginx-config path, which is overridable. Drop it into any
static-site CI that wants a pre-traffic SEO gate.