AI benchmarks from curious people, in one place.
The most interesting AI benchmarks right now aren't the academic ones. They're the personal ones. SnitchBench measures whether a model will report you to the FBI. BullshitBench measures whether it pushes back on a nonsense question. SlopBench measures how much AI slop a model writes. Each one lives on its own site, in its own format. I kept losing track of them, so I put them in one place.
Every score comes from the person who built the benchmark. They run it and publish the numbers. BenchDirectory pulls those numbers in and links straight back to them, so credit and clicks go to the people doing the work.
| Benchmark | Built by | Where the scores come from |
|---|---|---|
| SnitchBench | Theo | snitching-analysis.json in T3-Content/SnitchBench |
| BullshitBench | Peter Gostev | data/v2/latest/leaderboard.csv in petergpt/bullshit-benchmark (data/latest/ is the stale v1 export) |
| SlopBench | Dan Cleary | public leaderboard query on the production SlopBench backend |
| SkateBench | Theo | server-rendered leaderboard at skatebench.t3.gg (the committed JSON is a stale v1 run; the live v2 data was never pushed) |
| ScreenshotBench | Dan Cleary | public matrix query on the production screenshotbench.com backend |
| DeepSWE | Datacurve | server-rendered leaderboard at deepswe.datacurve.ai. No structured export, so the adapter parses the page and fails loudly if the markup changes |
| Planning Benchmark | bladnman | per-run branches in bladnman/planning_benchmark, each branch's results/PLAN_EVAL.md |
| Senior Engineer Bench | Every / Dan Shipper | hand-curated from the published Vibe Check articles (scores live in prose only) |
| CursorBench | Cursor | hand-curated from the published leaderboard page (the benchmark itself is closed) |
Hand-curated tables are marked in the UI and carry the creator's link. They're the fallback for benches whose owners publish numbers but not data files.
adapters/*.ts ──ingest──▶ src/data/*.json ──build──▶ static site
adapters/is one file per benchmark. It grabs the creator's published numbers (raw JSON/CSV from their repo, or a public API) and turns them into a single shape (adapters/types.ts).src/data/is the saved snapshots. Plain JSON you can diff and review, no backend required.src/is the React/Vite site that renders them, grouped by benchmark.
npm install
npm run ingest # refresh every snapshot from its source
npm run ingest -- snitchbench # refresh just one
npm run dev # local dev server
npm run build # type-check + production buildTwo ways.
Easiest: hit Submit a bench on the site and drop a link to your repo. That opens a pre-filled issue and we take it from there.
Or open a PR:
- Publish your results somewhere structured. A JSON or CSV in your repo is perfect.
- Copy any file in
adapters/, point it at your data, and fill in theBenchmarkMeta(name, links, what the score means, which direction is better). - Register it in
adapters/run.ts, runnpm run ingest, and commit the snapshot.
The UI picks up new snapshots on its own.
A GitHub Action re-checks every benchmark daily and commits only when a creator has actually published something new. People update on their own schedule. New models trickle in for days after a launch, so a daily check catches everyone without anyone needing to coordinate. Each section shows when the creator generated the data and when we pulled it.
- More benches (got one? open an issue or use the submit button)
- Cross-bench model report cards: one model, every indie bench
- Bar charts per benchmark
- A
benchmark.jsonmanifest so creators can push instead of being pulled
MIT