fetch, but it tells you the truth. A verified-fetch primitive for web scraping: every fetch returns the bytes plus a portable trust verdict, so you know the moment your data is silently wrong, not three days later through a broken downstream report.
pip install veriscrapeimport veriscrape
r = veriscrape.get("https://example.com")
r.verdict # OK | BLOCKED | CHALLENGE | HONEYPOT | SOFT_404 | LOGIN_WALL | EMPTY_SHELL | UNVERIFIED
r.cause # "cloudflare_challenge" | "datadome" | "js_app_shell" | ...
r.confidence # 0.0 to 1.0
r.ok # True only when the content is positively realEvery scraping tool hands you bytes and a 200 and calls it success. In 2026 a 200 OK is no longer
ground truth: it is often a challenge page, a login wall, a soft-404, or an empty JS shell.
Status-code retry logic (the industry default) never notices, so the corruption is stored as data and
surfaces days later. veriscrape classifies the response deterministically (no LLM) into a
verdict, with the evidence and a confidence score.
| verdict | meaning |
|---|---|
OK |
genuine origin content |
BLOCKED |
a hard anti-bot deny |
CHALLENGE |
a JS / CAPTCHA interstitial (solvable, not content) |
HONEYPOT |
a decoy / AI-Labyrinth trap |
SOFT_404 |
a "not found" served as 200 |
LOGIN_WALL |
a sign-in / paywall gate instead of the data |
EMPTY_SHELL |
a JS app skeleton with no server-rendered content |
UNVERIFIED |
couldn't tell, abstains rather than guess |
Detection is two-key and conservative: it would rather abstain (UNVERIFIED) than emit a
confident wrong OK, because a silent false OK is the exact failure the tool exists to prevent.
Today it detects BLOCKED, CHALLENGE, HONEYPOT, SOFT_404, LOGIN_WALL, and EMPTY_SHELL across
seven anti-bot vendors (Cloudflare, DataDome, Akamai, PerimeterX/HUMAN, Kasada, Imperva/Incapsula, F5 BIG-IP ASM)
and three CAPTCHA gates (reCAPTCHA, Turnstile, hCaptcha), plus vendor-agnostic content signals. A
positive OK is emitted for a 200 with substantial server-rendered content, but it stays
conservative: a thin or ambiguous page abstains to UNVERIFIED rather than risk a guessed OK.
$ veriscrape check https://discord.com/app
https://discord.com/app
!! EMPTY_SHELL (js_app_shell) confidence=0.97
HTTP 200The exit code is pipeline-friendly: 0 when content looks fine (OK / UNVERIFIED), 1 when a
problem is detected. Drop it into CI to fail a job that silently scraped a wall. veriscrape check --file response.html classifies a saved response with no network; --json emits the record.
We ran popular fetchers against a set of targets, captured the raw bodies, and labeled each one
independently of veriscrape (benchmark/):
discord.com/appandweb.telegram.orgreturn HTTP 200 with an empty JavaScript app-shell: no server-rendered content, just a mount point and a wall of scripts. Every status-code-only fetcher (requests,curl_cffi,scrapling) stores that husk as a successful page. The status says success, the bytes are a skeleton, and the corruption is saved as data with no signal anything went wrong.
A note on rigor: an earlier cut of this benchmark reported a higher, scrapling-specific rate. Independent re-labeling of the captured bodies showed one headline cell was a veriscrape false positive (a real homepage mislabeled as a login wall), so that framing is retracted and the detector is fixed. Catching that is the point: the tool is built to flag silently-wrong data, and that discipline has to apply to its own output first.
Reproduce: uv run --extra benchmark python -m benchmark.run.
For the longer story (why a 200 stopped being ground truth, and the design rules behind the verdicts), see why veriscrape exists or the dev.to write-up.
veriscrape.get() is the drop-in for requests.get, but you don't have to switch fetchers. Add
the verdict to what you already have:
from veriscrape.adapters import from_requests, from_response
record = from_requests(requests.get(url)) # a requests.Response
record = from_response(status, headers, body, url=url) # any stack (httpx, Playwright, ...)Scrapy: add veriscrape.adapters.VeriscrapeMiddleware to DOWNLOADER_MIDDLEWARES, then read
response.meta["veriscrape"] in your spider. Same verdict object everywhere.
The FetchRecord verdict is portable JSON you own: the same shape travels across stacks
(requests / Scrapy / Playwright) and trends per-domain over time. Every fetch emits one; that shared
object is the spine. Deterministic-first by design: verdicts are computed from status / headers /
cookies / body, dated and reproducible, never a black box.
Pre-alpha · deterministic-first · Apache-2.0 · drop-in for requests.get.
$ uv sync # for local development from a clone
$ uv run pytest # 157 tests