Skip to content

san64777/veriscrape

veriscrape

CI PyPI License: Apache 2.0 Python

fetch, but it tells you the truth. A verified-fetch primitive for web scraping: every fetch returns the bytes plus a portable trust verdict, so you know the moment your data is silently wrong, not three days later through a broken downstream report.

pip install veriscrape
import veriscrape

r = veriscrape.get("https://example.com")
r.verdict      # OK | BLOCKED | CHALLENGE | HONEYPOT | SOFT_404 | LOGIN_WALL | EMPTY_SHELL | UNVERIFIED
r.cause        # "cloudflare_challenge" | "datadome" | "js_app_shell" | ...
r.confidence   # 0.0 to 1.0
r.ok           # True only when the content is positively real

The problem

Every scraping tool hands you bytes and a 200 and calls it success. In 2026 a 200 OK is no longer ground truth: it is often a challenge page, a login wall, a soft-404, or an empty JS shell. Status-code retry logic (the industry default) never notices, so the corruption is stored as data and surfaces days later. veriscrape classifies the response deterministically (no LLM) into a verdict, with the evidence and a confidence score.

Verdicts

verdict meaning
OK genuine origin content
BLOCKED a hard anti-bot deny
CHALLENGE a JS / CAPTCHA interstitial (solvable, not content)
HONEYPOT a decoy / AI-Labyrinth trap
SOFT_404 a "not found" served as 200
LOGIN_WALL a sign-in / paywall gate instead of the data
EMPTY_SHELL a JS app skeleton with no server-rendered content
UNVERIFIED couldn't tell, abstains rather than guess

Detection is two-key and conservative: it would rather abstain (UNVERIFIED) than emit a confident wrong OK, because a silent false OK is the exact failure the tool exists to prevent. Today it detects BLOCKED, CHALLENGE, HONEYPOT, SOFT_404, LOGIN_WALL, and EMPTY_SHELL across seven anti-bot vendors (Cloudflare, DataDome, Akamai, PerimeterX/HUMAN, Kasada, Imperva/Incapsula, F5 BIG-IP ASM) and three CAPTCHA gates (reCAPTCHA, Turnstile, hCaptcha), plus vendor-agnostic content signals. A positive OK is emitted for a 200 with substantial server-rendered content, but it stays conservative: a thin or ambiguous page abstains to UNVERIFIED rather than risk a guessed OK.

CLI

$ veriscrape check https://discord.com/app
https://discord.com/app
  !! EMPTY_SHELL (js_app_shell)  confidence=0.97
  HTTP 200

The exit code is pipeline-friendly: 0 when content looks fine (OK / UNVERIFIED), 1 when a problem is detected. Drop it into CI to fail a job that silently scraped a wall. veriscrape check --file response.html classifies a saved response with no network; --json emits the record.

The finding

We ran popular fetchers against a set of targets, captured the raw bodies, and labeled each one independently of veriscrape (benchmark/):

discord.com/app and web.telegram.org return HTTP 200 with an empty JavaScript app-shell: no server-rendered content, just a mount point and a wall of scripts. Every status-code-only fetcher (requests, curl_cffi, scrapling) stores that husk as a successful page. The status says success, the bytes are a skeleton, and the corruption is saved as data with no signal anything went wrong.

A note on rigor: an earlier cut of this benchmark reported a higher, scrapling-specific rate. Independent re-labeling of the captured bodies showed one headline cell was a veriscrape false positive (a real homepage mislabeled as a login wall), so that framing is retracted and the detector is fixed. Catching that is the point: the tool is built to flag silently-wrong data, and that discipline has to apply to its own output first.

Reproduce: uv run --extra benchmark python -m benchmark.run.

For the longer story (why a 200 stopped being ground truth, and the design rules behind the verdicts), see why veriscrape exists or the dev.to write-up.

Use it with your existing stack

veriscrape.get() is the drop-in for requests.get, but you don't have to switch fetchers. Add the verdict to what you already have:

from veriscrape.adapters import from_requests, from_response

record = from_requests(requests.get(url))          # a requests.Response
record = from_response(status, headers, body, url=url)   # any stack (httpx, Playwright, ...)

Scrapy: add veriscrape.adapters.VeriscrapeMiddleware to DOWNLOADER_MIDDLEWARES, then read response.meta["veriscrape"] in your spider. Same verdict object everywhere.

Why a verdict, not just bytes

The FetchRecord verdict is portable JSON you own: the same shape travels across stacks (requests / Scrapy / Playwright) and trends per-domain over time. Every fetch emits one; that shared object is the spine. Deterministic-first by design: verdicts are computed from status / headers / cookies / body, dated and reproducible, never a black box.

Status

Pre-alpha · deterministic-first · Apache-2.0 · drop-in for requests.get.

$ uv sync          # for local development from a clone
$ uv run pytest    # 157 tests

About

fetch, but it tells you the truth: a verified-fetch primitive that returns a portable trust verdict, not just bytes.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages