Add public BU_Bench_V1 framework reverification runner by Alezander9 · Pull Request #14 · browser-use/benchmark

Alezander9 · 2026-05-12T22:34:18Z

Summary

add a local framework reverification runner for BU_Bench_V1 using the encrypted public task artifact
add public-safe framework adapters, local result and trace writing, and no-op Laminar compatibility shims
document the short framework eval entry point and guard against committing decrypted task JSON or traces

Safety notes

does not add plaintext tasks or internal result files
run_data remains local-only and gitignored because it can contain decrypted tasks, ground truth, model outputs, and screenshots

Validation

uv run python -m compileall run_framework_eval.py frameworks browsers laminar.py lmnr.py models.py
uv run python run_framework_eval.py --list-frameworks
uv run python -c "from frameworks import FRAMEWORKS, load_tasks; assert len(load_tasks('BU_Bench_V1')) == 100; assert 'bcode-v012' in FRAMEWORKS; print('ok')"
uv run python run_framework_eval.py --framework browser-use --model bu-2-0 --tasks 0
uv run python run_framework_eval.py --framework bcode-v012 --model gpt-5 --tasks 0

cubic-dev-ai

15 issues found across 52 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frameworks/browserbase_agent/package.json">

<violation number="1" location="frameworks/browserbase_agent/package.json:7">
P2: The Stagehand dependency is declared as a floating range (`^3.4.0`) even though this runner is documented as pinned to 3.4.0. Use an exact version to keep eval behavior reproducible.</violation>
</file>

<file name="frameworks/claude_code_harness_js/system_prompt.md">

<violation number="1" location="frameworks/claude_code_harness_js/system_prompt.md:16">
P2: The prompt has contradictory file-write rules: it requires saving screenshots to `/tmp/shots` but also forbids edits outside the working directory.</violation>
</file>

<file name="frameworks/browser_use/run_task.py">

<violation number="1" location="frameworks/browser_use/run_task.py:58">
P2: Create the agent inside the `try` so provider/browser cleanup still runs if agent initialization fails.</violation>
</file>

<file name="frameworks/pibt/run_task.py">

<violation number="1" location="frameworks/pibt/run_task.py:275">
P2: Cloud browser session leaks if code between `_start_browser()` and the `try` block raises. If `SYSTEM_PROMPT_FILE.read_text()` or `create_subprocess_exec` fails, `_stop_browser` is never reached because it's only in the inner `finally`.</violation>
</file>

<file name="frameworks/stagehand/executor.js">

<violation number="1" location="frameworks/stagehand/executor.js:49">
P1: Do not emit a successful placeholder result for an unimplemented executor; fail fast so the runner can mark Stagehand as unsupported/not ready.</violation>
</file>

<file name="frameworks/claude_code_harness_ab/run_task.py">

<violation number="1" location="frameworks/claude_code_harness_ab/run_task.py:295">
P2: Browser sessions can leak when subprocess startup fails because `_start_browser()` runs before the protected cleanup block.</violation>
</file>

<file name="frameworks/__init__.py">

<violation number="1" location="frameworks/__init__.py:309">
P2: Avoid catching `BaseException` here; it swallows interrupts/exit signals and can make the runner ignore Ctrl+C or shutdown events.</violation>
</file>

<file name="frameworks/bcode/run_task.py">

<violation number="1" location="frameworks/bcode/run_task.py:151">
P2: Handle failures after browser creation in `_start_browser` to avoid leaking cloud browser sessions.</violation>
</file>

<file name="frameworks/codex_harness/run_task.py">

<violation number="1" location="frameworks/codex_harness/run_task.py:264">
P2: Start browser lifecycle management inside an outer try/finally. Right now failures before the inner try block can leak the provisioned browser daemon.</violation>
</file>

<file name="frameworks/claude_cua/run_task.py">

<violation number="1" location="frameworks/claude_cua/run_task.py:68">
P2: The unimplemented adapter returns placeholder results instead of failing fast, which can silently produce misleading benchmark/judgement outputs.</violation>
</file>

<file name="frameworks/but_rust/run_task.py">

<violation number="1" location="frameworks/but_rust/run_task.py:192">
P1: Browser lifecycle is not protected by a global `try/finally`, so exceptions can leak started browser instances.</violation>

<violation number="2" location="frameworks/but_rust/run_task.py:329">
P2: Fallback final-text parsing ignores `payload.content`, which can drop valid assistant output for `assistant.message` events.</violation>
</file>

<file name="run_framework_eval.py">

<violation number="1" location="run_framework_eval.py:193">
P2: Validate `--parallel` as >= 1 before creating the semaphore; `0` causes a deadlock where no task can start.</violation>
</file>

<file name="frameworks/claude_code_harness_bu_cli/run_task.py">

<violation number="1" location="frameworks/claude_code_harness_bu_cli/run_task.py:273">
P2: Using `browser-use close --all` can kill sessions from other concurrent runs on the same host, creating cross-task interference.</violation>

<violation number="2" location="frameworks/claude_code_harness_bu_cli/run_task.py:323">
P1: If Claude process startup fails, the pre-provisioned cloud browser session is leaked because cleanup is only inside a later try/finally block.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

Add public framework reverification runner

3824beb

cubic-dev-ai Bot reviewed May 12, 2026

View reviewed changes

Address framework verifier review comments

e3a3ebf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add public BU_Bench_V1 framework reverification runner#14

Add public BU_Bench_V1 framework reverification runner#14
Alezander9 wants to merge 2 commits into
mainfrom
codex/bu-bench-framework-reverification

Alezander9 commented May 12, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alezander9 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Safety notes

Validation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alezander9 commented May 12, 2026 •

edited

Loading