Add public BU_Bench_V1 framework reverification runner#14
Open
Alezander9 wants to merge 2 commits into
Open
Conversation
There was a problem hiding this comment.
15 issues found across 52 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="frameworks/browserbase_agent/package.json">
<violation number="1" location="frameworks/browserbase_agent/package.json:7">
P2: The Stagehand dependency is declared as a floating range (`^3.4.0`) even though this runner is documented as pinned to 3.4.0. Use an exact version to keep eval behavior reproducible.</violation>
</file>
<file name="frameworks/claude_code_harness_js/system_prompt.md">
<violation number="1" location="frameworks/claude_code_harness_js/system_prompt.md:16">
P2: The prompt has contradictory file-write rules: it requires saving screenshots to `/tmp/shots` but also forbids edits outside the working directory.</violation>
</file>
<file name="frameworks/browser_use/run_task.py">
<violation number="1" location="frameworks/browser_use/run_task.py:58">
P2: Create the agent inside the `try` so provider/browser cleanup still runs if agent initialization fails.</violation>
</file>
<file name="frameworks/pibt/run_task.py">
<violation number="1" location="frameworks/pibt/run_task.py:275">
P2: Cloud browser session leaks if code between `_start_browser()` and the `try` block raises. If `SYSTEM_PROMPT_FILE.read_text()` or `create_subprocess_exec` fails, `_stop_browser` is never reached because it's only in the inner `finally`.</violation>
</file>
<file name="frameworks/stagehand/executor.js">
<violation number="1" location="frameworks/stagehand/executor.js:49">
P1: Do not emit a successful placeholder result for an unimplemented executor; fail fast so the runner can mark Stagehand as unsupported/not ready.</violation>
</file>
<file name="frameworks/claude_code_harness_ab/run_task.py">
<violation number="1" location="frameworks/claude_code_harness_ab/run_task.py:295">
P2: Browser sessions can leak when subprocess startup fails because `_start_browser()` runs before the protected cleanup block.</violation>
</file>
<file name="frameworks/__init__.py">
<violation number="1" location="frameworks/__init__.py:309">
P2: Avoid catching `BaseException` here; it swallows interrupts/exit signals and can make the runner ignore Ctrl+C or shutdown events.</violation>
</file>
<file name="frameworks/bcode/run_task.py">
<violation number="1" location="frameworks/bcode/run_task.py:151">
P2: Handle failures after browser creation in `_start_browser` to avoid leaking cloud browser sessions.</violation>
</file>
<file name="frameworks/codex_harness/run_task.py">
<violation number="1" location="frameworks/codex_harness/run_task.py:264">
P2: Start browser lifecycle management inside an outer try/finally. Right now failures before the inner try block can leak the provisioned browser daemon.</violation>
</file>
<file name="frameworks/claude_cua/run_task.py">
<violation number="1" location="frameworks/claude_cua/run_task.py:68">
P2: The unimplemented adapter returns placeholder results instead of failing fast, which can silently produce misleading benchmark/judgement outputs.</violation>
</file>
<file name="frameworks/but_rust/run_task.py">
<violation number="1" location="frameworks/but_rust/run_task.py:192">
P1: Browser lifecycle is not protected by a global `try/finally`, so exceptions can leak started browser instances.</violation>
<violation number="2" location="frameworks/but_rust/run_task.py:329">
P2: Fallback final-text parsing ignores `payload.content`, which can drop valid assistant output for `assistant.message` events.</violation>
</file>
<file name="run_framework_eval.py">
<violation number="1" location="run_framework_eval.py:193">
P2: Validate `--parallel` as >= 1 before creating the semaphore; `0` causes a deadlock where no task can start.</violation>
</file>
<file name="frameworks/claude_code_harness_bu_cli/run_task.py">
<violation number="1" location="frameworks/claude_code_harness_bu_cli/run_task.py:273">
P2: Using `browser-use close --all` can kill sessions from other concurrent runs on the same host, creating cross-task interference.</violation>
<violation number="2" location="frameworks/claude_code_harness_bu_cli/run_task.py:323">
P1: If Claude process startup fails, the pre-provisioned cloud browser session is leaked because cleanup is only inside a later try/finally block.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Safety notes
Validation