A compact, fast URL analysis pipeline:
- Extract URLs from arbitrary text files
- Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
- Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
- Output results as CSV or JSONL
- Standalone URL classifier (
classify-url) - Batch classification mode (
classify) - Supports rule presets (Japan/EU/global), custom YAML rules, explain mode, quiet mode
- Classification: Assigns categories (e.g., government, education) based on domain suffix rules.
- HTTP Verification: Checks reachability and captures status codes.
- Soft 404 Detection: Identifies pages that return a
200 OKstatus but contain "Page Not Found" text. - Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.
Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom "not found" message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:
- "page not found"
- "error 404"
- "the page you requested cannot be found"
If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these "ghost" pages from your results.
python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytesturlcheck-smith scan sample.txt -o urls.csvurlcheck-smith scan sample.txt \
--no-http \
--format jsonl \
-o urls.jsonlurlcheck-smith scan notes.txt --no-http -o urls_wo_status.csvurlcheck-smith scan urls.txt \
--rules my_rules.yaml \
-o result.csvurlcheck-smith scan urls.txt --preset japan -o out.csv
urlcheck-smith scan urls.txt --preset eu -o out.csv
urlcheck-smith scan urls.txt --preset global -o out.csvurlcheck-smith classify-url https://www.soumu.go.jp/urlcheck-smith classify-url https://www.soumu.go.jp/ --explainOutput example:
{
"url": "https://www.soumu.go.jp/",
"base_url": "www.soumu.go.jp",
"category": "government",
"explain": {
"matched_suffix": ".go.jp",
"category": "government"
}
}urlcheck-smith classify-url https://www.soumu.go.jp/ --quieturlcheck-smith classify-url https://www.gov.uk/ --preset eu
urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yamlInput file should contain one URL per line.
urlcheck-smith classify urls.txt -o classified.csvurlcheck-smith classify urls.txt --format jsonl -o out.jsonlurlcheck-smith classify urls.txt --quieturlcheck-smith classify urls.txt --explain -o out.jsonlsuffix_rules:
- suffix: ".go.jp"
category: government
- suffix: ".example.com"
category: internal
default_category: private--preset japan--preset eu--preset global
Each corresponds to a YAML file under urlcheck_smith/data/.
make install
make testMIT