Skip to content

A compact, fast URL analysis pipeline: Extract URLs from arbitrary text files; Classify domains using suffix-based "site runner" rules (government, edu, private, etc.); Optional HTTP checks (status, redirect, CAPTCHA/human-check); Output results as CSV or JSONL; Standalone URL classifier; Batch classification mode;

License

Notifications You must be signed in to change notification settings

yeiichi/urlcheck-smith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

urlcheck-smith

PyPI version Python versions Status License Tests

A compact, fast URL analysis pipeline:

  • Extract URLs from arbitrary text files
  • Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
  • Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
  • Output results as CSV or JSONL
  • Standalone URL classifier (classify-url)
  • Batch classification mode (classify)
  • Supports rule presets (Japan/EU/global), custom YAML rules, explain mode, quiet mode
  • Classification: Assigns categories (e.g., government, education) based on domain suffix rules.
  • HTTP Verification: Checks reachability and captures status codes.
  • Soft 404 Detection: Identifies pages that return a 200 OK status but contain "Page Not Found" text.
  • Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.

Features in Detail

Soft 404 Detection

Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom "not found" message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:

  • "page not found"
  • "error 404"
  • "the page you requested cannot be found"

If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these "ghost" pages from your results.

Installation (development)

python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytest

Commands Overview


1. scan — extract → classify → (optional) HTTP check

CSV output (default)

urlcheck-smith scan sample.txt -o urls.csv

JSONL output

urlcheck-smith scan sample.txt \
  --no-http \
  --format jsonl \
  -o urls.jsonl

Skip HTTP check

urlcheck-smith scan notes.txt --no-http -o urls_wo_status.csv

Custom rules

urlcheck-smith scan urls.txt \
  --rules my_rules.yaml \
  -o result.csv

Built-in rule presets

urlcheck-smith scan urls.txt --preset japan -o out.csv
urlcheck-smith scan urls.txt --preset eu -o out.csv
urlcheck-smith scan urls.txt --preset global -o out.csv

2. classify-url — classify a single URL

Default (JSON)

urlcheck-smith classify-url https://www.soumu.go.jp/

Explain mode

urlcheck-smith classify-url https://www.soumu.go.jp/ --explain

Output example:

{
  "url": "https://www.soumu.go.jp/",
  "base_url": "www.soumu.go.jp",
  "category": "government",
  "explain": {
    "matched_suffix": ".go.jp",
    "category": "government"
  }
}

Quiet mode (machine-friendly)

urlcheck-smith classify-url https://www.soumu.go.jp/ --quiet

Presets & custom rules

urlcheck-smith classify-url https://www.gov.uk/ --preset eu
urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yaml

3. classify — batch classify (no HTTP check)

Input file should contain one URL per line.

CSV output

urlcheck-smith classify urls.txt -o classified.csv

JSONL output

urlcheck-smith classify urls.txt --format jsonl -o out.jsonl

Quiet mode

urlcheck-smith classify urls.txt --quiet

Explain mode

urlcheck-smith classify urls.txt --explain -o out.jsonl

Rule System

Custom rule file example

suffix_rules:
  - suffix: ".go.jp"
    category: government
  - suffix: ".example.com"
    category: internal

default_category: private

Built-in presets

  • --preset japan
  • --preset eu
  • --preset global

Each corresponds to a YAML file under urlcheck_smith/data/.


Development

make install
make test

License

MIT

About

A compact, fast URL analysis pipeline: Extract URLs from arbitrary text files; Classify domains using suffix-based "site runner" rules (government, edu, private, etc.); Optional HTTP checks (status, redirect, CAPTCHA/human-check); Output results as CSV or JSONL; Standalone URL classifier; Batch classification mode;

Resources

License

Stars

Watchers

Forks

Packages

No packages published