SPA crawler

A CLI-friendly crawler that can optionally authenticate, crawl a website, and mirror pages and static assets into a local directory so the result can be served by a static web server.

The project targets modern SPAs and Next.js-style applications where content is rendered dynamically and traditional tools like wget or curl often fail to capture fully working pages.

Features

Optional authentication flow
- Fills in login/password inputs
- Submits the form
- Waits for redirect after successful login
Playwright-based rendering
- Supports SPAs, hydration, and client-side routing
- Handles dynamically loaded content
Mirrors HTML pages
- Saved to out/pages/**/index.html
Mirrors many static assets
- Examples: /_next/**, *.css, *.js, images, fonts, etc.
- Mirrors same-origin non-HTML document payloads as assets when frameworks use them for data transport
- Saved to out/assets/** and out/assets_q/**
Single browser session / session pool
- Designed to improve reliability during authenticated crawling
Additional URL discovery
- Extracts candidate links from the rendered DOM
- Reads Next.js __NEXT_DATA__ from the page
- Parses /_next/data/**.json payloads from intercepted responses
- Helps discover routes referenced in JSON/JS, not only in <a> tags
Redirect behavior capture (hybrid)
- Collects HTTP redirect edges from observed 3xx chains
- Captures client-side redirects when the loaded URL changes in the browser
- Exports high-confidence Caddy redirect rules to out/redirects.caddy
- Creates HTML redirect pages for missing source pages as a static-hosting fallback

Output structure

out/
  redirects.caddy
  pages/
    index.html
    nested_page/index.html
    ...
  pages_q/
    search/
      page=2/index.html
    ...
  assets/
    _next/static/...
    logo.svg
    favicon.ico
    ...
  assets_q/
    _next/static/chunk.js/
      v=123
    ...

Typical serving layout:

out/pages → HTML root
out/pages_q → query HTML variants (e.g. /search?page=2)
out/assets → static files root (or mounted under /, depending on server configuration)
out/assets_q → query-based static variants (e.g. /app.js?v=123)
out/redirects.caddy → generated Caddy redir rules from observed redirects
out/pages and out/pages_q may include generated HTML redirect pages for missing sources

Install

Install uv
Install dependencies:
```
make install-deps
```

Usage

The crawler is implemented as:

An async Python function crawl(config)
A Typer CLI wrapper

Basic flow:

make help

Then review these files for practical usage examples and deployment templates:

Makefile
Dockerfile.spa-crawler
Dockerfile.spa
docker-compose.spa.yml
Caddyfile

Published crawler image

On every push to main, GitHub Actions publishes the crawler image to GHCR:

ghcr.io/hu553in/spa-crawler:latest
ghcr.io/hu553in/spa-crawler:sha-<commit>

Build source:

Dockerfile.spa-crawler

The image expects crawler arguments at runtime. In practice you usually want to mount out/ so mirrored files remain on the host after the container exits.

Minimal example:

docker run --rm \
  -v "$(pwd)/out:/app/out" \
  ghcr.io/hu553in/spa-crawler:latest \
  --base-url https://example.com \
  --no-login-required

Authenticated example:

docker run --rm \
  -v "$(pwd)/out:/app/out" \
  -e SPA_CRAWLER_LOGIN="$SPA_CRAWLER_LOGIN" \
  -e SPA_CRAWLER_PASSWORD="$SPA_CRAWLER_PASSWORD" \
  -e CRAWLEE_MEMORY_MBYTES=20000 \
  -e CRAWLEE_MAX_USED_MEMORY_RATIO=0.95 \
  ghcr.io/hu553in/spa-crawler:latest \
  --base-url https://example.com \
  --login-required \
  --login-path /login \
  --login-input-selector "input[name='login']:visible" \
  --password-input-selector "input[name='password']:visible"

Notes:

SPA_CRAWLER_LOGIN and SPA_CRAWLER_PASSWORD are read from the environment.
--base-url still has to be passed explicitly; otherwise Typer prompts for it.
The published Docker image supports only headless mode. --no-headless is rejected in containers.
Mounting /app/out is strongly recommended. Without it, crawl output stays only inside the container filesystem.
/app/storage is declared as a volume for Crawlee runtime state. Mount it too if you want to inspect or persist that state across runs.

CLI filtering defaults

Include links: {base_url}/** when no include filters are provided
Exclude links: login regex only (.*{login_path}.*) when --login-required is set
API path prefixes: empty by default; add --api-path-prefix values if you want API routes excluded from page discovery, asset mirroring, and redirect collection

Deployment of mirrored site

This project only produces a mirrored static copy of a website. You are responsible for deciding how and where to deploy or serve it.

Example deployment stack included:

Dockerfile.spa
docker-compose.spa.yml
Caddyfile
Environment configuration via .env

Caddyfile imports /srv/redirects.caddy. Dockerfile.spa creates a no-op placeholder for this file when it is absent. Caddyfile also normalizes non-GET/HEAD methods by redirecting them to GET with 303 on the same URI (to avoid 405 Method Not Allowed errors on static mirrors).

To use HTTP basic authentication with Caddy, generate a password hash:

caddy hash-password

Then set the environment variables used by Caddyfile:

ENABLE_BASIC_AUTH=true
BASIC_AUTH_USER=<username>
BASIC_AUTH_PASSWORD_HASH=<output from previous command>

If you want a server other than Caddy

The repository ships only a Caddy serving configuration. For any other server, you must reimplement the same URL-to-filesystem lookup behavior.

What must be ported from the Caddyfile logic:

Page lookup without query: /pages{path} → /pages{path}/index.html → /pages{path}.html
Page lookup with query: /pages_q{path}/{query} → /pages_q{path}/{query}/index.html → /pages_q{path}/{query}.html (with fallback to non-query pages)
Asset lookup without query: /assets{path} → /assets{path}.* → /assets{path}.bin
Asset lookup with query: /assets_q{path}/{query} (with fallback to non-query assets)
Header policy: immutable cache for /_next/*, no-cache for mirrored HTML pages
Method policy: non-GET/HEAD requests are redirected with 303 to the same URI before static lookup

Redirect support must also be ported:

Current export is Caddy-specific (out/redirects.caddy with redir directives)
For another server, add a converter step (from observed redirects to that server's syntax) or implement a new Python exporter
HTML redirect pages in out/pages and out/pages_q are server-agnostic fallbacks and should still work if lookup is ported correctly

For Nginx specifically, reproducing query-based lookup ({query} in the filesystem path) and fallback chains usually requires njs or careful map + try_files composition.

Limitations

This is a hobby / experimental project. It aims to handle modern SPAs reasonably well but is not a fully robust site-mirroring solution.

Session configuration

Session behavior is currently hardcoded. There are no CLI arguments to tune session pool settings or advanced browser session parameters.

Authenticated crawling may require manual code adjustments.

High parallelism and memory usage

At high concurrency levels the crawler may:

Consume large amounts of RAM
Trigger repeated warnings about memory limits
Become unstable or slower

Recommended approach:

Use low concurrency
For authenticated crawling, use concurrency = 1

Hardware tuning

You can tune Crawlee memory behavior via environment variables:

CRAWLEE_MEMORY_MBYTES: absolute memory limit (in MB) used by Crawlee autoscaling
CRAWLEE_MAX_USED_MEMORY_RATIO: fraction of that limit that can be used before throttling

Example .env values:

CRAWLEE_MEMORY_MBYTES=20000
CRAWLEE_MAX_USED_MEMORY_RATIO=0.95

Tuning guidance:

Lower values can reduce OOM risk on smaller machines
Higher values can improve throughput on larger machines, but may increase RAM pressure

Large number of HTTP errors in output

During crawling you may see large amounts of:

404 responses
Failed asset requests
Transient navigation errors

This is expected behavior for modern SPAs and does not necessarily indicate crawler failure.

The crawler intentionally prioritizes successful page mirroring over eliminating every failed request.

Not all assets can be mirrored

The crawler downloads many static assets but cannot guarantee complete asset capture.

Some resources may be skipped due to:

Streaming or opaque responses
Dynamically generated URLs
Authentication-protected resources
Browser caching behavior
Implementation complexity
Unsafe or ambiguous query strings for static-server mapping

The mirrored site may occasionally require manual fixes.

URL discovery is heuristic

The crawler attempts to discover routes using:

DOM extraction
__NEXT_DATA__ parsing
/_next/data/**.json parsing

However, if a route is only accessible via complex client logic or hidden interactions, it may never be discovered automatically.

Manual entrypoints may be required.

Redirect export is observational

out/redirects.caddy and generated HTML redirect pages are based only on redirects observed during the crawl.

This means:

Paths never visited during the crawl will not have redirect rules
Ambiguous source URLs may be ignored if confidence is below threshold
Only one best target per source URL is exported

Stability vs. completeness trade-off

The project intentionally favors:

Simplicity
Maintainability
Ease of experimentation

over:

Perfect site replication
Exhaustive browser instrumentation

Tips / troubleshooting

SPA login inputs reset while typing

Some SPAs rerender login forms during hydration.

Increase the rerender timeout to allow DOM stabilization.

Pages exist but never get crawled

Common causes:

Routes exposed only via buttons or JS logic
Routes hidden in JSON menus
Conditional client routing

Possible fixes:

Add include globs/regexes
Add manual entrypoints via --additional-crawl-entrypoint-url
Extend URL extraction logic for project-specific patterns

Assets missing / CSS not loading

Assets are mirrored using Playwright request interception.

Some resource types cannot be reliably captured and will be skipped.

HTML document responses are intentionally stored from DOM snapshots in out/pages/** instead of being mirrored from raw route interception responses.

Unexpected logout or broken authentication

Recommended configuration:

concurrency = 1
Single session pool
No session rotation

Development status

This project is:

Experimental
Evolving
Intentionally pragmatic rather than complete

It is useful for:

Offline mirrors
Testing mirrored SPAs
Migration experiments
Static hosting tests

It is not intended as a universal or production-grade website archiving solution.

Ethics and legality

Only crawl content you are authorized to access and store.

Respect:

Website terms of service
Privacy rules
Copyright and licensing restrictions

Do not use this tool to extract or redistribute restricted data without permission.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
spa_crawler		spa_crawler
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Caddyfile		Caddyfile
Dockerfile.spa		Dockerfile.spa
Dockerfile.spa-crawler		Dockerfile.spa-crawler
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.spa.yml		docker-compose.spa.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPA crawler

Features

Output structure

Install

Usage

Published crawler image

CLI filtering defaults

Deployment of mirrored site

If you want a server other than Caddy

Limitations

Session configuration

High parallelism and memory usage

Hardware tuning

Large number of HTTP errors in output

Not all assets can be mirrored

URL discovery is heuristic

Redirect export is observational

Stability vs. completeness trade-off

Tips / troubleshooting

SPA login inputs reset while typing

Pages exist but never get crawled

Assets missing / CSS not loading

Unexpected logout or broken authentication

Development status

Ethics and legality

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPA crawler

Features

Output structure

Install

Usage

Published crawler image

CLI filtering defaults

Deployment of mirrored site

If you want a server other than Caddy

Limitations

Session configuration

High parallelism and memory usage

Hardware tuning

Large number of HTTP errors in output

Not all assets can be mirrored

URL discovery is heuristic

Redirect export is observational

Stability vs. completeness trade-off

Tips / troubleshooting

SPA login inputs reset while typing

Pages exist but never get crawled

Assets missing / CSS not loading

Unexpected logout or broken authentication

Development status

Ethics and legality

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages