feat(ci): add visual bug-fix agent with live-app access by kantord · Pull Request #2146 · stacklok/toolhive-studio

kantord · 2026-04-29T13:49:03Z

Summary

Adds a parallel "Visual Bug Fix Agent" — a label-gated copy of the production bug-fix agent (_bug-fix-agent.yml) with one structural addition: the project's devcontainer is booted alongside the agent and exposed via docker exec, so the agent can drive the live app (xdotool, screenshots, in-container test runs) for bugs that resist a pure unit-test reproduction.

Production behavior is unchanged. This adds two new files; nothing existing is modified.

Depends on PR #2120

The agent's prompt references scripts/devcontainer-screenshot.sh and scripts/devcontainer-steal.sh, plus the corrected devcontainer-dev skill. All three land in #2120. #2120 should merge first; otherwise this agent's first run would fail when invoking missing scripts.

What changes

File	Mirrors	Differences
`_visual-bug-fix-agent.yml`	`_bug-fix-agent.yml`	+ buildx setup, `node_modules` cache + populate, `devcontainers/ci` build/start, find container, launch entrypoint, readiness gate (process + `xdotool search --class ToolHive` + 2s settle); + `DEVCONTAINER_ID` env on each phase; + live-app paragraph in Phase 1 / 2 / 2b prompts; + `--allowedTools` extended with `docker exec/cp/ps` and the two helper scripts; + nm-cache dump step; + diagnostics-on-failure step; concurrency `visual-bug-fix-*`; branch suffix `-visual`; PR label `auto-fix-visual`; failure-comment marker `Visual Bug Fix Agent`; `timeout-minutes` 45→60; Phase 1 `--max-turns` 50→75
`visual-bug-fix-on-label.yml`	`bug-fix-on-label.yml`	Triggers on label `auto-fix-visual` (still requires `Bug` label)

The structural shape, gating, phases, hard gates, branch/PR creation, and failure handling are byte-equivalent to the production agent modulo the renames required for parallel runs.

Cost expectations

Rough estimates, sample of 2 production runs.

	Production agent today	+ visual variant (current)	+ #2136 (prebuilt image)
Per-fix wall time	~18 min	~22 min (+22%)	~19 min (+5%)

Phase 1 (5–9 min agent thinking) dominates regardless of substrate. The +4 min is the devcontainer build + boot. Once #2136 lands, that drops to ~30s.

Activation

The trigger is the auto-fix-visual label on a Bug-labeled issue. To test:

Merge chore: add visual bugfix agent #2120 (helper scripts + skill update).
Merge this PR.
Apply the auto-fix-visual label to a candidate bug. The natural first candidate is [Bug] When the server deletion finishes it focuses the MCP servers tab #663 (server deletion focuses MCP Servers tab), chosen earlier in the experiment for its async-IPC-driven cross-route behavior.

Both labels (auto-fix and auto-fix-visual) can be applied to the same issue if you want a head-to-head comparison.

What this PR does NOT prove

That live-app access actually improves the agent's success rate. The substrate is proven (in #2120). The agent integration is mechanical. Whether the +22% (or +5% post-#2136) wall-time tax is worth it depends entirely on how Phase 1 performs on bugs that resist unit-test repro — and that can only be measured by labeling real issues and observing outcomes.

Risk surface

Production agent is untouched; worst case rollback is git rm of two files.
If the agent misbehaves on a labeled issue, the label is trivially removed and the workflow has a "give-up" guard (final comment marker) that prevents repeat retries.
claude-code-action's workflow validation means the FIRST real run happens after merge — there is no pre-merge dry run available.

Copilot

Pull request overview

Adds a new, label-gated “Visual Bug Fix Agent” workflow that runs in parallel with the existing bug-fix agent, booting the project devcontainer to give the agent live-app (noVNC/X11) access for visual repro and debugging.

Changes:

Add visual-bug-fix-on-label.yml workflow to trigger on auto-fix-visual (and Bug) labels.
Add _visual-bug-fix-agent.yml reusable workflow that mirrors the production bug-fix agent while also building/starting the devcontainer, locating the container, launching the app, and gating readiness before running the agent phases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
.github/workflows/visual-bug-fix-on-label.yml	New label-trigger workflow that dispatches the reusable visual agent workflow.
.github/workflows/_visual-bug-fix-agent.yml	New reusable agent workflow that provisions a devcontainer-backed live app and extends the agent tool allowlist for container interaction.

Copilot · 2026-04-30T13:52:53Z

+            Useful commands:
+              # Capture screen to a runner-side path (NOT a container path), then Read it:
+              scripts/devcontainer-screenshot.sh ./shot.png
+              # Extract any other tmpfs file (logs, generated artifacts):
+              scripts/devcontainer-steal.sh "$DEVCONTAINER_ID" /tmp/foo ./foo
+              # Inspect the X11 window tree — the main app window has WM_CLASS=ToolHive:
+              docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xwininfo -root -tree'
+              # Drive input — keyboard is more reliable than mouse-coordinate clicks:
+              docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool key Tab'
+              docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool key Return'
+              # Mouse coordinates are fragile (modals shift the layout); use sparingly:
+              docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool mousemove X Y click 1'
+
+            WHEN TO USE THE LIVE APP:
+            Default to a unit-test reproduction. Reach for the live app only when
+            the bug genuinely requires it: cross-route side effects, async backend
+            lifecycle (`thv` workload events), real IPC. Pixel-driving is fragile
+            and slow — even when you use the live app for repro, your eventual
+            FAILING TEST should be a unit test that captures the underlying
+            invariant (e.g. mock the IPC event and assert the router behavior).
+            The unit test is what gates Phase 2.
+
+            Your task (Phase 1 — Analysis & Test):
+            1. Analyze the bug report: understand description, steps to reproduce, expected vs actual behavior.
+            2. Search the codebase to find the relevant source code.
+            3. Reproduce the bug. Use the live app if a unit test alone is not enough.
+            4. Write a unit test that captures the bug — it MUST FAIL when you run it.
+            5. Run the test with: pnpm run test:nonInteractive -- <test-file-path>
+            6. Verify the test fails FOR THE RIGHT REASON (not import errors or unrelated failures).
+            7. If the test passes (bug not reproduced), try a different approach (max 3 attempts).
+
+            Output:
+            - The test file in the correct __tests__/ directory.
+            - bug-analysis.md with your findings (follow the format in the skill).
+              Include a "Visual repro:" section if you used the live app.
+
+            Do NOT modify any source files. Only create/edit test files and bug-analysis.md.
+          claude_args: >-
+            --model opus
+            --max-turns 75
+            --allowedTools "Read,Grep,Glob,Edit,Write,Bash(pnpm run test:nonInteractive *),Bash(cat *),Bash(ls *),Bash(docker exec *),Bash(docker cp *),Bash(docker ps *),Bash(scripts/devcontainer-screenshot.sh *),Bash(scripts/devcontainer-steal.sh *)"


This workflow references helper scripts (scripts/devcontainer-screenshot.sh and scripts/devcontainer-steal.sh) that are not present in the repository as of this PR, so the agent will fail as soon as it tries to use the documented commands / allowedTools. Either include those scripts in this PR, or add an early preflight step that checks for them and fails fast with a clear message (or skips) when they’re missing (e.g., when this PR is merged without its dependency).

Copilot · 2026-04-30T13:52:53Z

+            if docker exec -u node "$C" bash -c '
+                curl -fsS http://localhost:6080/ >/dev/null 2>&1 \
+                && pgrep -f "[e]lectron/dist/electron" >/dev/null \
+                && pgrep -f "[t]hv serve" >/dev/null \
+                && DISPLAY=:99 xdotool search --class ToolHive >/dev/null 2>&1
+              '; then


The readiness probe curls http://localhost:6080/. In this repo’s devcontainer startup script, readiness is checked against /vnc.html (noVNC) rather than /, so probing / risks a false-negative if the web root doesn’t return 200. Align this check with the existing readiness convention (e.g., curl the same path used by scripts/devcontainer-dev.sh).

Copilot · 2026-04-30T13:52:53Z

+        timeout-minutes: 10
+        run: |
+          C=${{ steps.container.outputs.id }}
+          for i in $(seq 1 120); do
+            # Process gates AND a window-mapped check via xdotool. Process
+            # presence is not the same as "UI rendered" — see PR #2120.
+            if docker exec -u node "$C" bash -c '
+                curl -fsS http://localhost:6080/ >/dev/null 2>&1 \
+                && pgrep -f "[e]lectron/dist/electron" >/dev/null \
+                && pgrep -f "[t]hv serve" >/dev/null \
+                && DISPLAY=:99 xdotool search --class ToolHive >/dev/null 2>&1
+              '; then
+              sleep 2
+              echo "::notice::App ready in ${i} attempts"
+              exit 0
+            fi
+            sleep 5
+          done
+          echo "::error::App did not become ready within 10 minutes"
+          docker exec "$C" tail -200 /tmp/entrypoint.log || true
+          exit 1


The loop here can take exactly 10 minutes of sleep time alone (120 × 5s = 600s), and the step also has timeout-minutes: 10. Any command overhead can cause the step to hit the GitHub timeout before reaching your explicit failure path (printing logs + exiting 1), leading to harder-to-diagnose flakiness. Consider increasing the step timeout (e.g., 11–12 minutes) or reducing the iteration count to leave headroom.

Copilot · 2026-04-30T13:52:54Z

+        run: |
+          VOLUME=toolhive-node-modules-toolhive-studio
+          rm -rf nm-cache && mkdir -p nm-cache
+          docker run --rm \
+            -v "$VOLUME:/src:ro" \
+            -v "$PWD/nm-cache:/dst" \
+            alpine:3 sh -c 'cp -a /src/. /dst/' || true


Dump devcontainer node_modules for cache runs under always() on cache misses, even if earlier devcontainer setup failed. In that case the named volume may be empty/nonexistent, and this step will still create/mount it and potentially save an empty cache entry for this key, making subsequent runs think they have a warm cache when they don’t. Gate this dump on successful devcontainer start (or verify expected content exists in the volume) before populating nm-cache/ for actions/cache to save.

Suggested change

run: |

VOLUME=toolhive-node-modules-toolhive-studio

rm -rf nm-cache && mkdir -p nm-cache

docker run --rm \

-v "$VOLUME:/src:ro" \

-v "$PWD/nm-cache:/dst" \

alpine:3 sh -c 'cp -a /src/. /dst/' || true

&& steps.container.outcome == 'success'

run: |

VOLUME=toolhive-node-modules-toolhive-studio

rm -rf nm-cache

if docker run --rm \

-v "$VOLUME:/src:ro" \

alpine:3 sh -c 'find /src -mindepth 1 -print -quit | grep -q .'

then

mkdir -p nm-cache

docker run --rm \

-v "$VOLUME:/src:ro" \

-v "$PWD/nm-cache:/dst" \

alpine:3 sh -c 'cp -a /src/. /dst/'

else

echo "::notice::Skipping node_modules cache dump because the devcontainer volume is empty or unavailable."

fi

kantord added 2 commits April 28, 2026 19:05

feat(ci): add visual bug-fix agent with live-app access

d11441e

feat(ci): add visual bug-fix agent with live-app access

1944277

github-actions Bot added the size/M label Apr 29, 2026

kantord marked this pull request as ready for review April 30, 2026 13:47

Copilot AI review requested due to automatic review settings April 30, 2026 13:47

Copilot started reviewing on behalf of kantord April 30, 2026 13:48 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

take improvements from cloud ui

03e92a4

github-actions Bot added size/XL and removed size/M labels Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add visual bug-fix agent with live-app access#2146

feat(ci): add visual bug-fix agent with live-app access#2146
kantord wants to merge 3 commits intomainfrom
feat/visual-bug-fix-agent

kantord commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kantord commented Apr 29, 2026

Summary

Depends on PR #2120

What changes

Cost expectations

Activation

What this PR does NOT prove

Risk surface

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants