veriscrape

fetch, but it tells you the truth. A verified-fetch primitive for web scraping: every fetch returns the bytes plus a portable trust verdict, so you know the moment your data is silently wrong, not three days later through a broken downstream report.

pip install veriscrape

import veriscrape

r = veriscrape.get("https://example.com")
r.verdict      # OK | BLOCKED | CHALLENGE | HONEYPOT | SOFT_404 | LOGIN_WALL | EMPTY_SHELL | UNVERIFIED
r.cause        # "cloudflare_challenge" | "datadome" | "js_app_shell" | ...
r.confidence   # 0.0 to 1.0
r.ok           # True only when the content is positively real

The problem

Every scraping tool hands you bytes and a 200 and calls it success. In 2026 a 200 OK is no longer ground truth: it is often a challenge page, a login wall, a soft-404, or an empty JS shell. Status-code retry logic (the industry default) never notices, so the corruption is stored as data and surfaces days later. veriscrape classifies the response deterministically (no LLM) into a verdict, with the evidence and a confidence score.

Verdicts

verdict	meaning
`OK`	genuine origin content
`BLOCKED`	a hard anti-bot deny
`CHALLENGE`	a JS / CAPTCHA interstitial (solvable, not content)
`HONEYPOT`	a decoy / AI-Labyrinth trap
`SOFT_404`	a "not found" served as `200`
`LOGIN_WALL`	a sign-in / paywall gate instead of the data
`EMPTY_SHELL`	a JS app skeleton with no server-rendered content
`UNVERIFIED`	couldn't tell, abstains rather than guess

Detection is two-key and conservative: it would rather abstain (UNVERIFIED) than emit a confident wrong OK, because a silent false OK is the exact failure the tool exists to prevent. Today it detects BLOCKED, CHALLENGE, HONEYPOT, SOFT_404, LOGIN_WALL, and EMPTY_SHELL across seven anti-bot vendors (Cloudflare, DataDome, Akamai, PerimeterX/HUMAN, Kasada, Imperva/Incapsula, F5 BIG-IP ASM) and three CAPTCHA gates (reCAPTCHA, Turnstile, hCaptcha), plus vendor-agnostic content signals. A positive OK is emitted for a 200 with substantial server-rendered content, but it stays conservative: a thin or ambiguous page abstains to UNVERIFIED rather than risk a guessed OK.

CLI

$ veriscrape check https://discord.com/app
https://discord.com/app
  !! EMPTY_SHELL (js_app_shell)  confidence=0.97
  HTTP 200

The exit code is pipeline-friendly: 0 when content looks fine (OK / UNVERIFIED), 1 when a problem is detected. Drop it into CI to fail a job that silently scraped a wall. veriscrape check --file response.html classifies a saved response with no network; --json emits the record.

The finding

We ran popular fetchers against a set of targets, captured the raw bodies, and labeled each one independently of veriscrape (benchmark/):

discord.com/app and web.telegram.org return HTTP 200 with an empty JavaScript app-shell: no server-rendered content, just a mount point and a wall of scripts. Every status-code-only fetcher (requests, curl_cffi, scrapling) stores that husk as a successful page. The status says success, the bytes are a skeleton, and the corruption is saved as data with no signal anything went wrong.

A note on rigor: an earlier cut of this benchmark reported a higher, scrapling-specific rate. Independent re-labeling of the captured bodies showed one headline cell was a veriscrape false positive (a real homepage mislabeled as a login wall), so that framing is retracted and the detector is fixed. Catching that is the point: the tool is built to flag silently-wrong data, and that discipline has to apply to its own output first.

Reproduce: uv run --extra benchmark python -m benchmark.run.

For the longer story (why a 200 stopped being ground truth, and the design rules behind the verdicts), see why veriscrape exists or the dev.to write-up.

Use it with your existing stack

veriscrape.get() is the drop-in for requests.get, but you don't have to switch fetchers. Add the verdict to what you already have:

from veriscrape.adapters import from_requests, from_response

record = from_requests(requests.get(url))          # a requests.Response
record = from_response(status, headers, body, url=url)   # any stack (httpx, Playwright, ...)

Scrapy: add veriscrape.adapters.VeriscrapeMiddleware to DOWNLOADER_MIDDLEWARES, then read response.meta["veriscrape"] in your spider. Same verdict object everywhere.

Why a verdict, not just bytes

The FetchRecord verdict is portable JSON you own: the same shape travels across stacks (requests / Scrapy / Playwright) and trends per-domain over time. Every fetch emits one; that shared object is the spine. Deterministic-first by design: verdicts are computed from status / headers / cookies / body, dated and reproducible, never a black box.

Status

Pre-alpha · deterministic-first · Apache-2.0 · drop-in for requests.get.

$ uv sync          # for local development from a clone
$ uv run pytest    # 157 tests

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
benchmark		benchmark
src/veriscrape		src/veriscrape
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
WHY.md		WHY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

veriscrape

The problem

Verdicts

CLI

The finding

Use it with your existing stack

Why a verdict, not just bytes

Status

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

veriscrape

The problem

Verdicts

CLI

The finding

Use it with your existing stack

Why a verdict, not just bytes

Status

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages