A CLI-friendly crawler that can optionally authenticate, crawl a website, and mirror pages and static assets into a local directory so the result can be served by a static web server.
The project targets modern SPAs and Next.js-style applications where content is rendered dynamically and
traditional tools like wget or curl often fail to capture fully working pages.
-
Optional authentication flow
- Fills in login/password inputs
- Submits the form
- Waits for redirect after successful login
-
Playwright-based rendering
- Supports SPAs, hydration, and client-side routing
- Handles dynamically loaded content
-
Mirrors HTML pages
- Saved to
out/pages/**/index.html
- Saved to
-
Mirrors many static assets
- Examples:
/_next/**,*.css,*.js, images, fonts, etc. - Mirrors same-origin non-HTML
documentpayloads as assets when frameworks use them for data transport - Saved to
out/assets/**andout/assets_q/**
- Examples:
-
Single browser session / session pool
- Designed to improve reliability during authenticated crawling
-
Additional URL discovery
- Extracts candidate links from the rendered DOM
- Reads Next.js
__NEXT_DATA__from the page - Parses
/_next/data/**.jsonpayloads from intercepted responses - Helps discover routes referenced in JSON/JS, not only in
<a>tags
-
Redirect behavior capture (hybrid)
- Collects HTTP redirect edges from observed 3xx chains
- Captures client-side redirects when the loaded URL changes in the browser
- Exports high-confidence Caddy redirect rules to
out/redirects.caddy - Creates HTML redirect pages for missing source pages as a static-hosting fallback
out/
redirects.caddy
pages/
index.html
nested_page/index.html
...
pages_q/
search/
page=2/index.html
...
assets/
_next/static/...
logo.svg
favicon.ico
...
assets_q/
_next/static/chunk.js/
v=123
...
Typical serving layout:
out/pages→ HTML rootout/pages_q→ query HTML variants (e.g./search?page=2)out/assets→ static files root (or mounted under/, depending on server configuration)out/assets_q→ query-based static variants (e.g./app.js?v=123)out/redirects.caddy→ generated Caddyredirrules from observed redirectsout/pagesandout/pages_qmay include generated HTML redirect pages for missing sources
- Install uv
- Install dependencies:
make install-deps
The crawler is implemented as:
- An async Python function
crawl(config) - A Typer CLI wrapper
Basic flow:
make help
Then review these files for practical usage examples and deployment templates:
MakefileDockerfile.spa-crawlerDockerfile.spadocker-compose.spa.ymlCaddyfile
On every push to main, GitHub Actions publishes the crawler image to GHCR:
ghcr.io/hu553in/spa-crawler:latestghcr.io/hu553in/spa-crawler:sha-<commit>
Build source:
Dockerfile.spa-crawler
The image expects crawler arguments at runtime. In practice you usually want to mount out/
so mirrored files remain on the host after the container exits.
Minimal example:
docker run --rm \
-v "$(pwd)/out:/app/out" \
ghcr.io/hu553in/spa-crawler:latest \
--base-url https://example.com \
--no-login-requiredAuthenticated example:
docker run --rm \
-v "$(pwd)/out:/app/out" \
-e SPA_CRAWLER_LOGIN="$SPA_CRAWLER_LOGIN" \
-e SPA_CRAWLER_PASSWORD="$SPA_CRAWLER_PASSWORD" \
-e CRAWLEE_MEMORY_MBYTES=20000 \
-e CRAWLEE_MAX_USED_MEMORY_RATIO=0.95 \
ghcr.io/hu553in/spa-crawler:latest \
--base-url https://example.com \
--login-required \
--login-path /login \
--login-input-selector "input[name='login']:visible" \
--password-input-selector "input[name='password']:visible"Notes:
SPA_CRAWLER_LOGINandSPA_CRAWLER_PASSWORDare read from the environment.--base-urlstill has to be passed explicitly; otherwise Typer prompts for it.- The published Docker image supports only headless mode.
--no-headlessis rejected in containers. - Mounting
/app/outis strongly recommended. Without it, crawl output stays only inside the container filesystem. /app/storageis declared as a volume for Crawlee runtime state. Mount it too if you want to inspect or persist that state across runs.
- Include links:
{base_url}/**when no include filters are provided - Exclude links: login regex only (
.*{login_path}.*) when--login-requiredis set - API path prefixes: empty by default; add
--api-path-prefixvalues if you want API routes excluded from page discovery, asset mirroring, and redirect collection
This project only produces a mirrored static copy of a website. You are responsible for deciding how and where to deploy or serve it.
Example deployment stack included:
Dockerfile.spadocker-compose.spa.ymlCaddyfile- Environment configuration via
.env
Caddyfile imports /srv/redirects.caddy.
Dockerfile.spa creates a no-op placeholder for this file when it is absent.
Caddyfile also normalizes non-GET/HEAD methods by redirecting them to GET with 303 on the same URI
(to avoid 405 Method Not Allowed errors on static mirrors).
To use HTTP basic authentication with Caddy, generate a password hash:
caddy hash-password
Then set the environment variables used by Caddyfile:
ENABLE_BASIC_AUTH=trueBASIC_AUTH_USER=<username>BASIC_AUTH_PASSWORD_HASH=<output from previous command>
The repository ships only a Caddy serving configuration. For any other server, you must reimplement the same URL-to-filesystem lookup behavior.
What must be ported from the Caddyfile logic:
- Page lookup without query:
/pages{path}→/pages{path}/index.html→/pages{path}.html - Page lookup with query:
/pages_q{path}/{query}→/pages_q{path}/{query}/index.html→/pages_q{path}/{query}.html(with fallback to non-query pages) - Asset lookup without query:
/assets{path}→/assets{path}.*→/assets{path}.bin - Asset lookup with query:
/assets_q{path}/{query}(with fallback to non-query assets) - Header policy: immutable cache for
/_next/*, no-cache for mirrored HTML pages - Method policy: non-
GET/HEADrequests are redirected with303to the same URI before static lookup
Redirect support must also be ported:
- Current export is Caddy-specific (
out/redirects.caddywithredirdirectives) - For another server, add a converter step (from observed redirects to that server's syntax) or implement a new Python exporter
- HTML redirect pages in
out/pagesandout/pages_qare server-agnostic fallbacks and should still work if lookup is ported correctly
For Nginx specifically, reproducing query-based lookup ({query} in the filesystem path) and fallback chains
usually requires njs or careful map + try_files composition.
This is a hobby / experimental project. It aims to handle modern SPAs reasonably well but is not a fully robust site-mirroring solution.
Session behavior is currently hardcoded. There are no CLI arguments to tune session pool settings or advanced browser session parameters.
Authenticated crawling may require manual code adjustments.
At high concurrency levels the crawler may:
- Consume large amounts of RAM
- Trigger repeated warnings about memory limits
- Become unstable or slower
Recommended approach:
- Use low concurrency
- For authenticated crawling, use
concurrency = 1
You can tune Crawlee memory behavior via environment variables:
CRAWLEE_MEMORY_MBYTES: absolute memory limit (in MB) used by Crawlee autoscalingCRAWLEE_MAX_USED_MEMORY_RATIO: fraction of that limit that can be used before throttling
Example .env values:
CRAWLEE_MEMORY_MBYTES=20000
CRAWLEE_MAX_USED_MEMORY_RATIO=0.95
Tuning guidance:
- Lower values can reduce OOM risk on smaller machines
- Higher values can improve throughput on larger machines, but may increase RAM pressure
During crawling you may see large amounts of:
- 404 responses
- Failed asset requests
- Transient navigation errors
This is expected behavior for modern SPAs and does not necessarily indicate crawler failure.
The crawler intentionally prioritizes successful page mirroring over eliminating every failed request.
The crawler downloads many static assets but cannot guarantee complete asset capture.
Some resources may be skipped due to:
- Streaming or opaque responses
- Dynamically generated URLs
- Authentication-protected resources
- Browser caching behavior
- Implementation complexity
- Unsafe or ambiguous query strings for static-server mapping
The mirrored site may occasionally require manual fixes.
The crawler attempts to discover routes using:
- DOM extraction
__NEXT_DATA__parsing/_next/data/**.jsonparsing
However, if a route is only accessible via complex client logic or hidden interactions, it may never be discovered automatically.
Manual entrypoints may be required.
out/redirects.caddy and generated HTML redirect pages are based only on redirects observed during the crawl.
This means:
- Paths never visited during the crawl will not have redirect rules
- Ambiguous source URLs may be ignored if confidence is below threshold
- Only one best target per source URL is exported
The project intentionally favors:
- Simplicity
- Maintainability
- Ease of experimentation
over:
- Perfect site replication
- Exhaustive browser instrumentation
Some SPAs rerender login forms during hydration.
Increase the rerender timeout to allow DOM stabilization.
Common causes:
- Routes exposed only via buttons or JS logic
- Routes hidden in JSON menus
- Conditional client routing
Possible fixes:
- Add include globs/regexes
- Add manual entrypoints via
--additional-crawl-entrypoint-url - Extend URL extraction logic for project-specific patterns
Assets are mirrored using Playwright request interception.
Some resource types cannot be reliably captured and will be skipped.
HTML document responses are intentionally stored from DOM snapshots in out/pages/**
instead of being mirrored from raw route interception responses.
Recommended configuration:
concurrency = 1- Single session pool
- No session rotation
This project is:
- Experimental
- Evolving
- Intentionally pragmatic rather than complete
It is useful for:
- Offline mirrors
- Testing mirrored SPAs
- Migration experiments
- Static hosting tests
It is not intended as a universal or production-grade website archiving solution.
Only crawl content you are authorized to access and store.
Respect:
- Website terms of service
- Privacy rules
- Copyright and licensing restrictions
Do not use this tool to extract or redistribute restricted data without permission.