BugFind-15 is a visual benchmark for comparing how well LLMs identify and fix bugs without hallucinating extra problems. It provides 15 debugging scenarios with live run traces, deterministic rubric scoring, and execution-backed fix verification, all defined by METHODOLOGY.md.
BugFind-15 is organized into 5 categories with 3 scenarios each:
- Syntax & Surface Errors
- Logic & Algorithmic Errors
- Subtle & Tricky Bugs
- Red Herring Resistance
- Multi-Turn Debugging
Each scenario is graded across three axes:
- Identification
- Fix Quality
- Discipline
Category E can also apply a multi-turn bonus or penalty when the model asks especially good or bad clarification questions.
The Docker verifier service is required for real benchmark runs. BugFind-15 is not just a prompt-and-score UI; it depends on native execution to verify the model's submitted fix.
For every scenario, each model receives:
- A shared debugger system prompt.
- The scenario's user message and code sample.
- For multi-turn scenarios, one scripted clarification only if the model asks a question.
The runner then:
- Calls the model through
/chat/completions. - Records the response trace.
- Injects the scripted follow-up when the scenario allows it.
- Sends the model's final answer to the verifier sandbox service for exact execution-backed fix checking.
- Evaluates identification and discipline with deterministic scenario-specific checks, plus sandbox-backed fix verification.
- Streams progress into the dashboard over Server-Sent Events.
Official execution verification only uses one exact tagged payload from the model's final answer:
<solution language="python|javascript|rust|go" verdict="fix">
corrected code here
</solution>Trap scenarios must instead use:
<solution language="python|javascript|rust|go" verdict="no_bug"></solution>If the final answer omits the tag, uses the wrong language, or uses the wrong verdict, official sandbox verification fails for that scenario.
Provider errors and request timeouts are retried up to 3 total attempts with backoff. Model requests time out after 30 seconds by default, and the timeout can be overridden with MODEL_REQUEST_TIMEOUT_SECONDS in .env.
BugFind-15 includes a separate Docker-based verification sandbox service for executing code with real runtimes and compilers. This service is required during benchmark runs because Fix Quality depends on execution-backed verification.
- Python via
python3 - JavaScript via
node - Rust via
rustc - Go via pinned
go 1.21
The canonical runner uses a locked-down container with no network access, a read-only root filesystem, and a temporary writable /tmp. The long-running service container uses the same verifier image and runtime limits, but exposes port 4010 so the app server can send model replies to it.
For normal usage, you should have two processes running:
- The verifier sandbox service
- The web app
Start the verifier web service:
npm run verify:sandbox:serveIf the Docker image does not exist yet, that command builds it automatically before starting the service.
Stop the verifier web service:
npm run verify:sandbox:stopRun all canonical scenario checks:
npm run verify:canonicalRebuild the image and run all checks:
npm run verify:canonical:rebuildRun a single scenario or variant:
node scripts/verify-sandbox.mjs run --scenario BF-08
node scripts/verify-sandbox.mjs run --scenario BF-15 --variant fixedThe app server reads the verifier URL from BUGFIND_SANDBOX_URL and calls POST /verify-answer after each final model reply. If the service is not running, the benchmark cannot perform official fix verification, so the run should be treated as incomplete rather than authoritative. If the service is running but the model does not provide a valid <solution> block, the verifier returns a failure rather than inferring a fix.
- Each scenario produces a 0-100 score from the weighted axis rubric in METHODOLOGY.md.
- Scenario cells are shown as
pass,partial, orfailbased on score thresholds. - Category scores are averaged per category.
- The final score is a weighted average of the 5 category scores:
- A: 15%
- B: 25%
- C: 25%
- D: 20%
- E: 15%
BugFind-15 accepts models from five OpenAI-compatible providers:
openrouterollamallamacppmlxlmstudio
Model configuration uses comma-separated provider:model entries.
BugFind-15 reads configuration from .env. The main variables are:
OPENROUTER_API_KEYRequired only if any configured model uses theopenrouterprovider.OLLAMA_HOSTBase URL for Ollama. Required only ifLLM_MODELSorLLM_MODELS_2contains anollama:model.LLAMACPP_HOSTBase URL for allama.cppOpenAI-compatible server. Required only if you use allamacpp:model.MLX_HOSTBase URL for anmlx_lmOpenAI-compatible server. Required only if you use anmlx:model.LMSTUDIO_HOSTBase URL for LM Studio. Required only if you use anlmstudio:model.MODEL_REQUEST_TIMEOUT_SECONDSPer-request model timeout in seconds. Defaults to30. Timeout failures are retried up to 3 total attempts.BUGFIND_SANDBOX_URLURL of the required verifier service. Defaults tohttp://127.0.0.1:4010.BUGFIND_SANDBOX_TIMEOUT_MSTimeout for requests from the app server to the verifier service, in milliseconds. Defaults to20000.LLM_MODELSComma-separatedprovider:modellist for the primary benchmark table.LLM_MODELS_2Optional second comma-separatedprovider:modellist for a secondary table/group in the UI.
Notes:
- Local provider hosts can usually be given as either the raw host or an existing
/v1endpoint. The app normalizes them to the expected OpenAI-compatible base URL. - The verifier service is required for authoritative runs, so
BUGFIND_SANDBOX_URLshould point to a running sandbox instance.
One-time setup:
npm install
cp .env.example .envThen run the two required processes.
Terminal 1:
npm run verify:sandbox:serveTerminal 2:
npm run devOpen http://localhost:3000.
Required runtime workflow:
- terminal 1:
npm run verify:sandbox:serve - terminal 2:
npm run dev
npm run lint
npm run typecheck
npm run build
npm run verify:canonical- app/ contains the Next.js app router entry points and styles.
- components/dashboard.tsx renders the benchmark UI and live event handling.
- app/api/run/route.ts streams benchmark progress over Server-Sent Events.
- lib/benchmark.ts defines the BugFind-15 scenarios, scoring rubric, and multi-turn follow-ups.
- lib/orchestrator.ts runs scenarios and captures traces.
- lib/llm-client.ts contains the OpenAI-compatible client adapter.
- lib/sandbox-client.ts sends model replies to the verifier service.
- lib/models.ts parses provider configuration and model groups.
- verification/ contains the Docker image, verifier service, and native execution fixtures for all 15 scenarios.
- scripts/verify-sandbox.mjs builds and runs the verifier image from the repo root.
This project is licensed under the MIT License. See LICENSE.
