Author: @SagarBiswas-MultiHAT
Category: Educational Web Crawling & Client-Side Security Analysis
Status: Learning-grade, interview-safe, portfolio-ready
“This project performs depth-controlled crawling and client-side source reconstruction, capturing everything a browser can observe from a given URL, while intentionally respecting server-side trust boundaries.”
Tested example: python ".\PasourceDownloader.pyssword-Strength-Checker" https://sagarbiswas-multihat.github.io/ --depth 2
This project is an educational website source code downloader and crawler that extracts and reconstructs everything a browser can observe from a given URL.
It crawls a site with depth-controlled BFS, downloads client-visible resources (HTML, CSS, JS, images, fonts, PDFs, etc.), and rewrites links so the pages work offline, even on nested paths like /blog/*.
The tool respects server trust boundaries and does not attempt to fetch backend code, databases, or private data.
This tool captures everything a browser can retrieve from a URL:
- HTML pages (multiple pages via crawling)
- Linked CSS files
- JavaScript files
- Images (including
srcset) - Fonts and media files
- PDFs and other static assets
- XML files (e.g.,
sitemap.xml) - Correct offline reconstruction via path rewriting
Ideal for:
- Learning how real websites are structured
- Offline inspection and analysis
- Client-side security research
- Understanding exposure and attack surface
- Portfolio demonstrations of crawling logic
By design, this project does not:
- Download backend source code (PHP, Python, Node.js, etc.)
- Access databases or APIs that require authentication
- Execute JavaScript (SPA/React/Vue rendering)
- Bypass authentication, paywalls, or access controls
- Retrieve secrets, tokens, or server configuration
These limitations are intentional and make the project accurate and interview-safe.
- Depth-controlled crawling (
--depth 2or--depth 1-2) - Breadth-First Search (BFS) for reliable depth measurement
- Crawl vs analyze separation to control what gets saved
- Same-origin enforcement (no external domain crawling)
- Offline-safe path rewriting for nested pages
- Asset handling for
src,href, andsrcset - URL decoding (
%20→ spaces) - Query collision handling via hash suffix
- Content-type aware saving for missing extensions
- XML-aware parsing for sitemaps and RSS
- Export visited URLs (
--export-urls) to a text file for analysis
Depth is measured in link hops from the base URL:
0→ only the base URL1→ base URL + pages directly linked from it2→ links from depth-1 pages1-2→ crawl broadly, analyze only depth 1–2
Example structure:
Depth 0
└── https://example.com
Depth 1
├── /about
├── /blog
└── /login
Depth 2
├── /blog/post-1
├── /blog/post-2
└── /about/team
Depth control reduces noise and focuses on pages where real-world issues usually live.
Recommended: use a virtual environment.
python -m venv .venvActivate:
Windows (PowerShell):
.venv\Scripts\Activate.ps1Linux / macOS:
source .venv/bin/activateInstall dependencies:
- Install from the bundled
requirements.txt(recommended):
pip install -r requirements.txt- Or install packages individually (equivalent):
pip install beautifulsoup4 lxmlNotes:
lxmlis optional but recommended (it provides a robust XML parser and removes XML parsing warnings).- Use the Python provided in your
.venvwhen running thepipcommand to ensure packages install into the virtual environment.
python sourceDownloader.py <BASE_URL> --depth <DEPTH> [--export-urls]| Option | Description |
|---|---|
url |
Base URL to crawl (include http:// or https://) |
--depth |
Crawl depth (e.g., 2 or 1-2). Default: 0 |
--export-urls |
Export all visited URLs to urls.txt in the output directory |
# Crawl only the base URL
python sourceDownloader.py https://example.com --depth 0
# Crawl base URL and pages directly linked from it
python sourceDownloader.py https://example.com --depth 1
# Crawl depth 1-2, analyze pages at depth 1 and 2
python sourceDownloader.py https://example.com --depth 1-2
# Crawl and export all visited URLs to urls.txt
python sourceDownloader.py https://example.com --depth 2 --export-urlsexample_com/
├── index.html
├── assets/
│ ├── css/
│ ├── js/
│ ├── images/
│ └── fonts/
├── blog/
│ ├── post-1.html
│ └── post-2.html
└── urls.txt (when using --export-urls)
All pages open offline without broken CSS or images.
Many crawlers rewrite assets relative to the project root, which breaks pages like /blog/post.html.
This project rewrites assets relative to each HTML file’s directory, so both root and nested pages load properly.
- Assets are saved as binary, so images and PDFs stay intact.
- Pages are saved as HTML, with rewritten local paths.
- External URLs (GitHub badges, CDNs) are kept external.
- Fragment-only links (
#about) are ignored to reduce crawl noise.
- Images missing on nested pages → fixed by file-relative rewriting
- Responsive images missing →
srcsetentries are downloaded and rewritten - Resume/PDF not opening → binary assets are saved directly
- XML warnings → install
lxmlor ignore (HTML fallback is handled)
This tool is for educational and authorized testing only. Always respect terms of service and robots policies.
- Headless rendering (Playwright) for JS-heavy sites
- robots.txt enforcement
- JSON crawl reports
- Security header analysis
- Rate limiting and concurrency
- Authentication support for authorized environments
This project is designed to be honest, technically correct, and impressive without exaggeration. It demonstrates strong understanding of web architecture, crawling logic, and security boundaries.

