feat(ci): add visual bug-fix agent with live-app access#2146
feat(ci): add visual bug-fix agent with live-app access#2146
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new, label-gated “Visual Bug Fix Agent” workflow that runs in parallel with the existing bug-fix agent, booting the project devcontainer to give the agent live-app (noVNC/X11) access for visual repro and debugging.
Changes:
- Add
visual-bug-fix-on-label.ymlworkflow to trigger onauto-fix-visual(andBug) labels. - Add
_visual-bug-fix-agent.ymlreusable workflow that mirrors the production bug-fix agent while also building/starting the devcontainer, locating the container, launching the app, and gating readiness before running the agent phases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| .github/workflows/visual-bug-fix-on-label.yml | New label-trigger workflow that dispatches the reusable visual agent workflow. |
| .github/workflows/_visual-bug-fix-agent.yml | New reusable agent workflow that provisions a devcontainer-backed live app and extends the agent tool allowlist for container interaction. |
| Useful commands: | ||
| # Capture screen to a runner-side path (NOT a container path), then Read it: | ||
| scripts/devcontainer-screenshot.sh ./shot.png | ||
| # Extract any other tmpfs file (logs, generated artifacts): | ||
| scripts/devcontainer-steal.sh "$DEVCONTAINER_ID" /tmp/foo ./foo | ||
| # Inspect the X11 window tree — the main app window has WM_CLASS=ToolHive: | ||
| docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xwininfo -root -tree' | ||
| # Drive input — keyboard is more reliable than mouse-coordinate clicks: | ||
| docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool key Tab' | ||
| docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool key Return' | ||
| # Mouse coordinates are fragile (modals shift the layout); use sparingly: | ||
| docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool mousemove X Y click 1' | ||
|
|
||
| WHEN TO USE THE LIVE APP: | ||
| Default to a unit-test reproduction. Reach for the live app only when | ||
| the bug genuinely requires it: cross-route side effects, async backend | ||
| lifecycle (`thv` workload events), real IPC. Pixel-driving is fragile | ||
| and slow — even when you use the live app for repro, your eventual | ||
| FAILING TEST should be a unit test that captures the underlying | ||
| invariant (e.g. mock the IPC event and assert the router behavior). | ||
| The unit test is what gates Phase 2. | ||
|
|
||
| Your task (Phase 1 — Analysis & Test): | ||
| 1. Analyze the bug report: understand description, steps to reproduce, expected vs actual behavior. | ||
| 2. Search the codebase to find the relevant source code. | ||
| 3. Reproduce the bug. Use the live app if a unit test alone is not enough. | ||
| 4. Write a unit test that captures the bug — it MUST FAIL when you run it. | ||
| 5. Run the test with: pnpm run test:nonInteractive -- <test-file-path> | ||
| 6. Verify the test fails FOR THE RIGHT REASON (not import errors or unrelated failures). | ||
| 7. If the test passes (bug not reproduced), try a different approach (max 3 attempts). | ||
|
|
||
| Output: | ||
| - The test file in the correct __tests__/ directory. | ||
| - bug-analysis.md with your findings (follow the format in the skill). | ||
| Include a "Visual repro:" section if you used the live app. | ||
|
|
||
| Do NOT modify any source files. Only create/edit test files and bug-analysis.md. | ||
| claude_args: >- | ||
| --model opus | ||
| --max-turns 75 | ||
| --allowedTools "Read,Grep,Glob,Edit,Write,Bash(pnpm run test:nonInteractive *),Bash(cat *),Bash(ls *),Bash(docker exec *),Bash(docker cp *),Bash(docker ps *),Bash(scripts/devcontainer-screenshot.sh *),Bash(scripts/devcontainer-steal.sh *)" |
There was a problem hiding this comment.
This workflow references helper scripts (scripts/devcontainer-screenshot.sh and scripts/devcontainer-steal.sh) that are not present in the repository as of this PR, so the agent will fail as soon as it tries to use the documented commands / allowedTools. Either include those scripts in this PR, or add an early preflight step that checks for them and fails fast with a clear message (or skips) when they’re missing (e.g., when this PR is merged without its dependency).
| if docker exec -u node "$C" bash -c ' | ||
| curl -fsS http://localhost:6080/ >/dev/null 2>&1 \ | ||
| && pgrep -f "[e]lectron/dist/electron" >/dev/null \ | ||
| && pgrep -f "[t]hv serve" >/dev/null \ | ||
| && DISPLAY=:99 xdotool search --class ToolHive >/dev/null 2>&1 | ||
| '; then |
There was a problem hiding this comment.
The readiness probe curls http://localhost:6080/. In this repo’s devcontainer startup script, readiness is checked against /vnc.html (noVNC) rather than /, so probing / risks a false-negative if the web root doesn’t return 200. Align this check with the existing readiness convention (e.g., curl the same path used by scripts/devcontainer-dev.sh).
| timeout-minutes: 10 | ||
| run: | | ||
| C=${{ steps.container.outputs.id }} | ||
| for i in $(seq 1 120); do | ||
| # Process gates AND a window-mapped check via xdotool. Process | ||
| # presence is not the same as "UI rendered" — see PR #2120. | ||
| if docker exec -u node "$C" bash -c ' | ||
| curl -fsS http://localhost:6080/ >/dev/null 2>&1 \ | ||
| && pgrep -f "[e]lectron/dist/electron" >/dev/null \ | ||
| && pgrep -f "[t]hv serve" >/dev/null \ | ||
| && DISPLAY=:99 xdotool search --class ToolHive >/dev/null 2>&1 | ||
| '; then | ||
| sleep 2 | ||
| echo "::notice::App ready in ${i} attempts" | ||
| exit 0 | ||
| fi | ||
| sleep 5 | ||
| done | ||
| echo "::error::App did not become ready within 10 minutes" | ||
| docker exec "$C" tail -200 /tmp/entrypoint.log || true | ||
| exit 1 |
There was a problem hiding this comment.
The loop here can take exactly 10 minutes of sleep time alone (120 × 5s = 600s), and the step also has timeout-minutes: 10. Any command overhead can cause the step to hit the GitHub timeout before reaching your explicit failure path (printing logs + exiting 1), leading to harder-to-diagnose flakiness. Consider increasing the step timeout (e.g., 11–12 minutes) or reducing the iteration count to leave headroom.
| run: | | ||
| VOLUME=toolhive-node-modules-toolhive-studio | ||
| rm -rf nm-cache && mkdir -p nm-cache | ||
| docker run --rm \ | ||
| -v "$VOLUME:/src:ro" \ | ||
| -v "$PWD/nm-cache:/dst" \ | ||
| alpine:3 sh -c 'cp -a /src/. /dst/' || true |
There was a problem hiding this comment.
Dump devcontainer node_modules for cache runs under always() on cache misses, even if earlier devcontainer setup failed. In that case the named volume may be empty/nonexistent, and this step will still create/mount it and potentially save an empty cache entry for this key, making subsequent runs think they have a warm cache when they don’t. Gate this dump on successful devcontainer start (or verify expected content exists in the volume) before populating nm-cache/ for actions/cache to save.
| run: | | |
| VOLUME=toolhive-node-modules-toolhive-studio | |
| rm -rf nm-cache && mkdir -p nm-cache | |
| docker run --rm \ | |
| -v "$VOLUME:/src:ro" \ | |
| -v "$PWD/nm-cache:/dst" \ | |
| alpine:3 sh -c 'cp -a /src/. /dst/' || true | |
| && steps.container.outcome == 'success' | |
| run: | | |
| VOLUME=toolhive-node-modules-toolhive-studio | |
| rm -rf nm-cache | |
| if docker run --rm \ | |
| -v "$VOLUME:/src:ro" \ | |
| alpine:3 sh -c 'find /src -mindepth 1 -print -quit | grep -q .' | |
| then | |
| mkdir -p nm-cache | |
| docker run --rm \ | |
| -v "$VOLUME:/src:ro" \ | |
| -v "$PWD/nm-cache:/dst" \ | |
| alpine:3 sh -c 'cp -a /src/. /dst/' | |
| else | |
| echo "::notice::Skipping node_modules cache dump because the devcontainer volume is empty or unavailable." | |
| fi |
Summary
Adds a parallel "Visual Bug Fix Agent" — a label-gated copy of the production bug-fix agent (
_bug-fix-agent.yml) with one structural addition: the project's devcontainer is booted alongside the agent and exposed viadocker exec, so the agent can drive the live app (xdotool, screenshots, in-container test runs) for bugs that resist a pure unit-test reproduction.Production behavior is unchanged. This adds two new files; nothing existing is modified.
Depends on PR #2120
The agent's prompt references
scripts/devcontainer-screenshot.shandscripts/devcontainer-steal.sh, plus the correcteddevcontainer-devskill. All three land in #2120. #2120 should merge first; otherwise this agent's first run would fail when invoking missing scripts.What changes
_visual-bug-fix-agent.yml_bug-fix-agent.ymlnode_modulescache + populate,devcontainers/cibuild/start, find container, launch entrypoint, readiness gate (process +xdotool search --class ToolHive+ 2s settle); +DEVCONTAINER_IDenv on each phase; + live-app paragraph in Phase 1 / 2 / 2b prompts; +--allowedToolsextended withdocker exec/cp/psand the two helper scripts; + nm-cache dump step; + diagnostics-on-failure step; concurrencyvisual-bug-fix-*; branch suffix-visual; PR labelauto-fix-visual; failure-comment markerVisual Bug Fix Agent;timeout-minutes45→60; Phase 1--max-turns50→75visual-bug-fix-on-label.ymlbug-fix-on-label.ymlauto-fix-visual(still requiresBuglabel)The structural shape, gating, phases, hard gates, branch/PR creation, and failure handling are byte-equivalent to the production agent modulo the renames required for parallel runs.
Cost expectations
Phase 1 (5–9 min agent thinking) dominates regardless of substrate. The +4 min is the devcontainer build + boot. Once #2136 lands, that drops to ~30s.
Activation
The trigger is the
auto-fix-visuallabel on aBug-labeled issue. To test:auto-fix-visuallabel to a candidate bug. The natural first candidate is [Bug] When the server deletion finishes it focuses the MCP servers tab #663 (server deletion focuses MCP Servers tab), chosen earlier in the experiment for its async-IPC-driven cross-route behavior.Both labels (
auto-fixandauto-fix-visual) can be applied to the same issue if you want a head-to-head comparison.What this PR does NOT prove
That live-app access actually improves the agent's success rate. The substrate is proven (in #2120). The agent integration is mechanical. Whether the +22% (or +5% post-#2136) wall-time tax is worth it depends entirely on how Phase 1 performs on bugs that resist unit-test repro — and that can only be measured by labeling real issues and observing outcomes.
Risk surface
git rmof two files.claude-code-action's workflow validation means the FIRST real run happens after merge — there is no pre-merge dry run available.