A Git-like web page archival system that stores only deltas between versions, providing an efficient and scalable solution for tracking website changes over time.
- Efficient Delta Storage: Instead of storing full copies of each page version, the archiver stores only the differences (deltas) between them. This significantly reduces storage requirements, especially for frequently updated sites.
- HTTP Conditional Requests: The archiver uses
ETagandLast-Modifiedheaders to avoid re-downloading pages that have not changed, minimizing bandwidth usage and improving performance. - Ad and Tracker Stripping: A built-in mechanism removes common ad and tracker scripts from archived pages, providing a cleaner and more secure offline reading experience.
- Automatic Asset Embedding: All external assets, such as CSS, JavaScript, and images, are automatically embedded into the HTML as data URIs. This creates a complete, self-contained offline archive of each page.
- SQLite Backend: The system uses a simple and portable SQLite database to store all page data, including versions, metadata, and deltas.
- Dockerized and Automated: The entire application is containerized for easy deployment and management. The archiver runs on an automated cron schedule, ensuring that websites are regularly checked for updates.
- Web Interface: A Flask-based web interface allows you to view archived pages, compare versions, and monitor the status of all tracked sites.
- Backend: Python, Flask
- Database: SQLite
- Containerization: Docker, Docker Compose
- Libraries:
requestsfor HTTP requestsBeautifulSoupfor HTML parsingdiff-match-patchfor generating deltasgzipfor data compression
The application is divided into two main services:
web: A Flask web server that provides a user-friendly interface to view the archived pages.archiver: A Python script that runs on a cron schedule (e.g., every 15 minutes) to fetch and archive the pages listed insites.json.
When the archiver runs, it performs the following steps for each URL:
- It sends a conditional HTTP request to check if the page has been modified since the last backup.
- If the page has changed, it downloads the new content and strips out any ads or trackers.
- It generates a "delta" by comparing the new version with the last archived version.
- This delta is then compressed and stored in the SQLite database.
This delta-based approach ensures that only the changes are stored, making the system highly efficient.
-
Clone the repository:
git clone https://github.com/your-username/web-page-archiver.git cd web-page-archiver -
Configure the archiver:
- Add the URLs you want to monitor to the
sites.jsonfile:{ "https://example.com": null } - (Optional) Create a
.envfile to configure the subdomain limit:SUBDOMAIN_LIMIT=10
- Add the URLs you want to monitor to the
-
Build and run the Docker containers:
docker-compose up --build
- The archiver will automatically run every 15 minutes to fetch and archive the pages.
- The web interface will be available at
http://localhost:4444.
- Improved diff visualization: Implement a more user-friendly side-by-side diff view in the web interface.
- Full-text search: Add the ability to search the content of all archived pages.
- Support for more content types: Add support for archiving PDFs, images, and other file types.