Save web pages as Markdown, HTML, screenshots, metadata, and searchable local archives.
WebCapsule is a local-first web archiving CLI for researchers, writers, journalists, students, and digital preservation workflows. It turns any URL into a durable, offline-readable folder with clean Markdown, original HTML, metadata, a full-page screenshot, checksums, and SQLite full-text search.
Use it when bookmarks are not enough: archive articles, preserve web pages, save online research, build a personal knowledge base, or keep link rot from destroying your notes. Everything stays on your own machine. No cloud. No account. No lock-in.
If WebCapsule disappears tomorrow, your archive stays readable forever.
Keywords: web archiver, local web archive, offline web archive, save web pages as Markdown, website screenshot CLI, personal archive, digital preservation, link rot, Python archiving tool, searchable article archive.
| Bookmarks | WebCapsule |
|---|---|
| Save where something was | Save what it said |
| Break when links die | Survive forever on your disk |
| Need the original site | Need nothing but a file explorer |
| Format changes whenever a platform wants | Markdown and JSON - always the same |
- Saving web pages as Markdown for Obsidian, notes, research, and long-term reading
- Preserving articles before links break, pages disappear, or platforms change
- Creating a local, searchable archive without a hosted service
- Capturing full-page screenshots plus original HTML for reference
- Keeping citation-friendly metadata such as title, author, date, source URL, and tags
- Exporting an archive in portable ZIP or TAR formats
WebCapsule is not a public crawler like the Wayback Machine and it is not a read-it-later cloud service like Pocket or Instapaper. It is closer to a personal, local alternative to bookmarking: one command saves the content itself in open formats.
If you need a self-hosted web archiving server, tools such as ArchiveBox may be a better fit. If you want a small Python CLI that produces human-readable Markdown folders you can keep forever, WebCapsule is built for that.
archives/
climate/
2026-05-20-the-great-barrier-reef-article/
README.md <- open this first - human-friendly summary
content.md <- full article in clean Markdown
original.html <- exact HTML snapshot
screenshot.png <- full-page screenshot
metadata.json <- structured metadata (title, author, date...)
checksum.sha256 <- file integrity hashes
Every file is readable without WebCapsule. Open content.md in any text editor. Done.
From a local checkout:
pip install -e ".[dev]"
playwright install chromium # one-time browser download (~150 MB)After a PyPI release, regular users will be able to install with pip install webcapsule.
webcapsule save https://example.com/articleWith options:
# Assign to a collection and add tags
webcapsule save https://example.com/article --collection climate --tag science --tag ocean
# Force headless browser rendering (for JavaScript-heavy sites)
webcapsule save https://example.com/spa --browser
# Skip screenshot (faster)
webcapsule save https://example.com/article --no-screenshot
# Check what would be archived without writing files
webcapsule save https://example.com/article --dry-run
# Save the same URL again even if it was archived recently
webcapsule save https://example.com/article --forceBy default, WebCapsule skips re-saving the same URL if it was archived in the last 24 hours.
webcapsule search "climate change"
webcapsule search "title:python" # column-filtered search
webcapsule search "ocean AND reef" # boolean operatorswebcapsule list
webcapsule list --limit 10webcapsule open ~/webcapsule/general/2026-05-20-example-articleIf the path is a capsule folder, WebCapsule opens its README.md.
webcapsule export ~/backups/my-archive.zip
webcapsule export ~/backups/my-archive --format tarwebcapsule rebuild-indexBy default, capsules are saved to ~/webcapsule/.
Change it with an environment variable:
export WEBCAPSULE_ARCHIVE=/mnt/external-drive/my-archive- Local-first - everything stays on your machine
- Offline-first - works without any internet after saving
- Open formats - Markdown, JSON, HTML, PNG - readable anywhere
- No vendor lock-in - no accounts, no subscriptions, no APIs
- Human-readable - a non-technical person can open and understand the archive
- AI optional - never required, never forced
- Useful even if abandoned - the archive outlives the tool
WebCapsule is intentionally modular. Each concern lives in its own file:
| Module | Responsibility |
|---|---|
fetcher.py |
Retrieve HTML (httpx fast path, Playwright fallback) |
parser.py |
Extract readable content (readability-lxml + BeautifulSoup) |
markdown_gen.py |
Convert parsed HTML to Markdown |
metadata.py |
Scrape OpenGraph, JSON-LD, and meta tags |
screenshot.py |
Full-page PNG via Playwright |
archive.py |
Write the capsule folder and compute checksums |
search.py |
SQLite FTS5 local search index, including body text |
cli.py |
Typer-powered command-line interface |
- Python 3.11+
- Playwright - headless browser for JS-heavy pages and screenshots
- readability-lxml - article extraction (same algorithm as Firefox Reader View)
- BeautifulSoup4 - HTML parsing fallback and metadata scraping
- markdownify - HTML to Markdown conversion
- httpx - fast HTTP client
- Typer + Rich - CLI and pretty output
- SQLite FTS5 - local full-text search (no external search server)
WebCapsule is built for:
- Personal research and note-taking
- Journalism and fact-checking
- Education and study
- Cultural preservation
- Digital memory
It is not designed to bypass paywalls, violate robots.txt, or enable mass scraping. Archive responsibly.
See ROADMAP.md for the full plan.
Short version:
- MVP CLI (save, search, list, export)
- Full-page screenshots
- SQLite full-text search
- Full-body text indexing
- Checksum integrity
- Recent-URL deduplication
- Browser extension
- Obsidian vault export
- Wayback Machine integration
- WARC support
- Local web UI
- Citation generation (APA, MLA, Chicago)
- Archive deduplication
Contributions are welcome - see CONTRIBUTING.md to get started.
The single guiding rule: don't sacrifice simplicity.
MIT - see LICENSE.