Skip to content

Add public BU_Bench_V1 framework reverification runner#14

Open
Alezander9 wants to merge 2 commits into
mainfrom
codex/bu-bench-framework-reverification
Open

Add public BU_Bench_V1 framework reverification runner#14
Alezander9 wants to merge 2 commits into
mainfrom
codex/bu-bench-framework-reverification

Conversation

@Alezander9
Copy link
Copy Markdown
Member

@Alezander9 Alezander9 commented May 12, 2026

Summary

  • add a local framework reverification runner for BU_Bench_V1 using the encrypted public task artifact
  • add public-safe framework adapters, local result and trace writing, and no-op Laminar compatibility shims
  • document the short framework eval entry point and guard against committing decrypted task JSON or traces

Safety notes

  • does not add plaintext tasks or internal result files
  • run_data remains local-only and gitignored because it can contain decrypted tasks, ground truth, model outputs, and screenshots

Validation

  • uv run python -m compileall run_framework_eval.py frameworks browsers laminar.py lmnr.py models.py
  • uv run python run_framework_eval.py --list-frameworks
  • uv run python -c "from frameworks import FRAMEWORKS, load_tasks; assert len(load_tasks('BU_Bench_V1')) == 100; assert 'bcode-v012' in FRAMEWORKS; print('ok')"
  • uv run python run_framework_eval.py --framework browser-use --model bu-2-0 --tasks 0
  • uv run python run_framework_eval.py --framework bcode-v012 --model gpt-5 --tasks 0

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 issues found across 52 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frameworks/browserbase_agent/package.json">

<violation number="1" location="frameworks/browserbase_agent/package.json:7">
P2: The Stagehand dependency is declared as a floating range (`^3.4.0`) even though this runner is documented as pinned to 3.4.0. Use an exact version to keep eval behavior reproducible.</violation>
</file>

<file name="frameworks/claude_code_harness_js/system_prompt.md">

<violation number="1" location="frameworks/claude_code_harness_js/system_prompt.md:16">
P2: The prompt has contradictory file-write rules: it requires saving screenshots to `/tmp/shots` but also forbids edits outside the working directory.</violation>
</file>

<file name="frameworks/browser_use/run_task.py">

<violation number="1" location="frameworks/browser_use/run_task.py:58">
P2: Create the agent inside the `try` so provider/browser cleanup still runs if agent initialization fails.</violation>
</file>

<file name="frameworks/pibt/run_task.py">

<violation number="1" location="frameworks/pibt/run_task.py:275">
P2: Cloud browser session leaks if code between `_start_browser()` and the `try` block raises. If `SYSTEM_PROMPT_FILE.read_text()` or `create_subprocess_exec` fails, `_stop_browser` is never reached because it's only in the inner `finally`.</violation>
</file>

<file name="frameworks/stagehand/executor.js">

<violation number="1" location="frameworks/stagehand/executor.js:49">
P1: Do not emit a successful placeholder result for an unimplemented executor; fail fast so the runner can mark Stagehand as unsupported/not ready.</violation>
</file>

<file name="frameworks/claude_code_harness_ab/run_task.py">

<violation number="1" location="frameworks/claude_code_harness_ab/run_task.py:295">
P2: Browser sessions can leak when subprocess startup fails because `_start_browser()` runs before the protected cleanup block.</violation>
</file>

<file name="frameworks/__init__.py">

<violation number="1" location="frameworks/__init__.py:309">
P2: Avoid catching `BaseException` here; it swallows interrupts/exit signals and can make the runner ignore Ctrl+C or shutdown events.</violation>
</file>

<file name="frameworks/bcode/run_task.py">

<violation number="1" location="frameworks/bcode/run_task.py:151">
P2: Handle failures after browser creation in `_start_browser` to avoid leaking cloud browser sessions.</violation>
</file>

<file name="frameworks/codex_harness/run_task.py">

<violation number="1" location="frameworks/codex_harness/run_task.py:264">
P2: Start browser lifecycle management inside an outer try/finally. Right now failures before the inner try block can leak the provisioned browser daemon.</violation>
</file>

<file name="frameworks/claude_cua/run_task.py">

<violation number="1" location="frameworks/claude_cua/run_task.py:68">
P2: The unimplemented adapter returns placeholder results instead of failing fast, which can silently produce misleading benchmark/judgement outputs.</violation>
</file>

<file name="frameworks/but_rust/run_task.py">

<violation number="1" location="frameworks/but_rust/run_task.py:192">
P1: Browser lifecycle is not protected by a global `try/finally`, so exceptions can leak started browser instances.</violation>

<violation number="2" location="frameworks/but_rust/run_task.py:329">
P2: Fallback final-text parsing ignores `payload.content`, which can drop valid assistant output for `assistant.message` events.</violation>
</file>

<file name="run_framework_eval.py">

<violation number="1" location="run_framework_eval.py:193">
P2: Validate `--parallel` as >= 1 before creating the semaphore; `0` causes a deadlock where no task can start.</violation>
</file>

<file name="frameworks/claude_code_harness_bu_cli/run_task.py">

<violation number="1" location="frameworks/claude_code_harness_bu_cli/run_task.py:273">
P2: Using `browser-use close --all` can kill sessions from other concurrent runs on the same host, creating cross-task interference.</violation>

<violation number="2" location="frameworks/claude_code_harness_bu_cli/run_task.py:323">
P1: If Claude process startup fails, the pre-provisioned cloud browser session is leaked because cleanup is only inside a later try/finally block.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread frameworks/stagehand/executor.js Outdated
Comment thread frameworks/but_rust/run_task.py
Comment thread frameworks/claude_code_harness_bu_cli/run_task.py Outdated
Comment thread frameworks/browserbase_agent/package.json Outdated
Comment thread frameworks/claude_code_harness_js/system_prompt.md Outdated
Comment thread frameworks/codex_harness/run_task.py
Comment thread frameworks/claude_cua/run_task.py Outdated
Comment thread frameworks/but_rust/run_task.py Outdated
Comment thread run_framework_eval.py
Comment thread frameworks/claude_code_harness_bu_cli/run_task.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant