Skip to content

feat(ci): add visual bug-fix agent with live-app access#2146

Open
kantord wants to merge 3 commits intomainfrom
feat/visual-bug-fix-agent
Open

feat(ci): add visual bug-fix agent with live-app access#2146
kantord wants to merge 3 commits intomainfrom
feat/visual-bug-fix-agent

Conversation

@kantord
Copy link
Copy Markdown
Member

@kantord kantord commented Apr 29, 2026

Summary

Adds a parallel "Visual Bug Fix Agent" — a label-gated copy of the production bug-fix agent (_bug-fix-agent.yml) with one structural addition: the project's devcontainer is booted alongside the agent and exposed via docker exec, so the agent can drive the live app (xdotool, screenshots, in-container test runs) for bugs that resist a pure unit-test reproduction.

Production behavior is unchanged. This adds two new files; nothing existing is modified.

Depends on PR #2120

The agent's prompt references scripts/devcontainer-screenshot.sh and scripts/devcontainer-steal.sh, plus the corrected devcontainer-dev skill. All three land in #2120. #2120 should merge first; otherwise this agent's first run would fail when invoking missing scripts.

What changes

File Mirrors Differences
_visual-bug-fix-agent.yml _bug-fix-agent.yml + buildx setup, node_modules cache + populate, devcontainers/ci build/start, find container, launch entrypoint, readiness gate (process + xdotool search --class ToolHive + 2s settle); + DEVCONTAINER_ID env on each phase; + live-app paragraph in Phase 1 / 2 / 2b prompts; + --allowedTools extended with docker exec/cp/ps and the two helper scripts; + nm-cache dump step; + diagnostics-on-failure step; concurrency visual-bug-fix-*; branch suffix -visual; PR label auto-fix-visual; failure-comment marker Visual Bug Fix Agent; timeout-minutes 45→60; Phase 1 --max-turns 50→75
visual-bug-fix-on-label.yml bug-fix-on-label.yml Triggers on label auto-fix-visual (still requires Bug label)

The structural shape, gating, phases, hard gates, branch/PR creation, and failure handling are byte-equivalent to the production agent modulo the renames required for parallel runs.

Cost expectations

Rough estimates, sample of 2 production runs.

Production agent today + visual variant (current) + #2136 (prebuilt image)
Per-fix wall time ~18 min ~22 min (+22%) ~19 min (+5%)

Phase 1 (5–9 min agent thinking) dominates regardless of substrate. The +4 min is the devcontainer build + boot. Once #2136 lands, that drops to ~30s.

Activation

The trigger is the auto-fix-visual label on a Bug-labeled issue. To test:

  1. Merge chore: add visual bugfix agent #2120 (helper scripts + skill update).
  2. Merge this PR.
  3. Apply the auto-fix-visual label to a candidate bug. The natural first candidate is [Bug] When the server deletion finishes it focuses the MCP servers tab #663 (server deletion focuses MCP Servers tab), chosen earlier in the experiment for its async-IPC-driven cross-route behavior.

Both labels (auto-fix and auto-fix-visual) can be applied to the same issue if you want a head-to-head comparison.

What this PR does NOT prove

That live-app access actually improves the agent's success rate. The substrate is proven (in #2120). The agent integration is mechanical. Whether the +22% (or +5% post-#2136) wall-time tax is worth it depends entirely on how Phase 1 performs on bugs that resist unit-test repro — and that can only be measured by labeling real issues and observing outcomes.

Risk surface

  • Production agent is untouched; worst case rollback is git rm of two files.
  • If the agent misbehaves on a labeled issue, the label is trivially removed and the workflow has a "give-up" guard (final comment marker) that prevents repeat retries.
  • claude-code-action's workflow validation means the FIRST real run happens after merge — there is no pre-merge dry run available.

@kantord kantord marked this pull request as ready for review April 30, 2026 13:47
Copilot AI review requested due to automatic review settings April 30, 2026 13:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new, label-gated “Visual Bug Fix Agent” workflow that runs in parallel with the existing bug-fix agent, booting the project devcontainer to give the agent live-app (noVNC/X11) access for visual repro and debugging.

Changes:

  • Add visual-bug-fix-on-label.yml workflow to trigger on auto-fix-visual (and Bug) labels.
  • Add _visual-bug-fix-agent.yml reusable workflow that mirrors the production bug-fix agent while also building/starting the devcontainer, locating the container, launching the app, and gating readiness before running the agent phases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
.github/workflows/visual-bug-fix-on-label.yml New label-trigger workflow that dispatches the reusable visual agent workflow.
.github/workflows/_visual-bug-fix-agent.yml New reusable agent workflow that provisions a devcontainer-backed live app and extends the agent tool allowlist for container interaction.

Comment on lines +213 to +253
Useful commands:
# Capture screen to a runner-side path (NOT a container path), then Read it:
scripts/devcontainer-screenshot.sh ./shot.png
# Extract any other tmpfs file (logs, generated artifacts):
scripts/devcontainer-steal.sh "$DEVCONTAINER_ID" /tmp/foo ./foo
# Inspect the X11 window tree — the main app window has WM_CLASS=ToolHive:
docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xwininfo -root -tree'
# Drive input — keyboard is more reliable than mouse-coordinate clicks:
docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool key Tab'
docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool key Return'
# Mouse coordinates are fragile (modals shift the layout); use sparingly:
docker exec -u node "$DEVCONTAINER_ID" bash -c 'DISPLAY=:99 xdotool mousemove X Y click 1'

WHEN TO USE THE LIVE APP:
Default to a unit-test reproduction. Reach for the live app only when
the bug genuinely requires it: cross-route side effects, async backend
lifecycle (`thv` workload events), real IPC. Pixel-driving is fragile
and slow — even when you use the live app for repro, your eventual
FAILING TEST should be a unit test that captures the underlying
invariant (e.g. mock the IPC event and assert the router behavior).
The unit test is what gates Phase 2.

Your task (Phase 1 — Analysis & Test):
1. Analyze the bug report: understand description, steps to reproduce, expected vs actual behavior.
2. Search the codebase to find the relevant source code.
3. Reproduce the bug. Use the live app if a unit test alone is not enough.
4. Write a unit test that captures the bug — it MUST FAIL when you run it.
5. Run the test with: pnpm run test:nonInteractive -- <test-file-path>
6. Verify the test fails FOR THE RIGHT REASON (not import errors or unrelated failures).
7. If the test passes (bug not reproduced), try a different approach (max 3 attempts).

Output:
- The test file in the correct __tests__/ directory.
- bug-analysis.md with your findings (follow the format in the skill).
Include a "Visual repro:" section if you used the live app.

Do NOT modify any source files. Only create/edit test files and bug-analysis.md.
claude_args: >-
--model opus
--max-turns 75
--allowedTools "Read,Grep,Glob,Edit,Write,Bash(pnpm run test:nonInteractive *),Bash(cat *),Bash(ls *),Bash(docker exec *),Bash(docker cp *),Bash(docker ps *),Bash(scripts/devcontainer-screenshot.sh *),Bash(scripts/devcontainer-steal.sh *)"
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow references helper scripts (scripts/devcontainer-screenshot.sh and scripts/devcontainer-steal.sh) that are not present in the repository as of this PR, so the agent will fail as soon as it tries to use the documented commands / allowedTools. Either include those scripts in this PR, or add an early preflight step that checks for them and fails fast with a clear message (or skips) when they’re missing (e.g., when this PR is merged without its dependency).

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +169
if docker exec -u node "$C" bash -c '
curl -fsS http://localhost:6080/ >/dev/null 2>&1 \
&& pgrep -f "[e]lectron/dist/electron" >/dev/null \
&& pgrep -f "[t]hv serve" >/dev/null \
&& DISPLAY=:99 xdotool search --class ToolHive >/dev/null 2>&1
'; then
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readiness probe curls http://localhost:6080/. In this repo’s devcontainer startup script, readiness is checked against /vnc.html (noVNC) rather than /, so probing / risks a false-negative if the web root doesn’t return 200. Align this check with the existing readiness convention (e.g., curl the same path used by scripts/devcontainer-dev.sh).

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +178
timeout-minutes: 10
run: |
C=${{ steps.container.outputs.id }}
for i in $(seq 1 120); do
# Process gates AND a window-mapped check via xdotool. Process
# presence is not the same as "UI rendered" — see PR #2120.
if docker exec -u node "$C" bash -c '
curl -fsS http://localhost:6080/ >/dev/null 2>&1 \
&& pgrep -f "[e]lectron/dist/electron" >/dev/null \
&& pgrep -f "[t]hv serve" >/dev/null \
&& DISPLAY=:99 xdotool search --class ToolHive >/dev/null 2>&1
'; then
sleep 2
echo "::notice::App ready in ${i} attempts"
exit 0
fi
sleep 5
done
echo "::error::App did not become ready within 10 minutes"
docker exec "$C" tail -200 /tmp/entrypoint.log || true
exit 1
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop here can take exactly 10 minutes of sleep time alone (120 × 5s = 600s), and the step also has timeout-minutes: 10. Any command overhead can cause the step to hit the GitHub timeout before reaching your explicit failure path (printing logs + exiting 1), leading to harder-to-diagnose flakiness. Consider increasing the step timeout (e.g., 11–12 minutes) or reducing the iteration count to leave headroom.

Copilot uses AI. Check for mistakes.
Comment on lines +470 to +476
run: |
VOLUME=toolhive-node-modules-toolhive-studio
rm -rf nm-cache && mkdir -p nm-cache
docker run --rm \
-v "$VOLUME:/src:ro" \
-v "$PWD/nm-cache:/dst" \
alpine:3 sh -c 'cp -a /src/. /dst/' || true
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dump devcontainer node_modules for cache runs under always() on cache misses, even if earlier devcontainer setup failed. In that case the named volume may be empty/nonexistent, and this step will still create/mount it and potentially save an empty cache entry for this key, making subsequent runs think they have a warm cache when they don’t. Gate this dump on successful devcontainer start (or verify expected content exists in the volume) before populating nm-cache/ for actions/cache to save.

Suggested change
run: |
VOLUME=toolhive-node-modules-toolhive-studio
rm -rf nm-cache && mkdir -p nm-cache
docker run --rm \
-v "$VOLUME:/src:ro" \
-v "$PWD/nm-cache:/dst" \
alpine:3 sh -c 'cp -a /src/. /dst/' || true
&& steps.container.outcome == 'success'
run: |
VOLUME=toolhive-node-modules-toolhive-studio
rm -rf nm-cache
if docker run --rm \
-v "$VOLUME:/src:ro" \
alpine:3 sh -c 'find /src -mindepth 1 -print -quit | grep -q .'
then
mkdir -p nm-cache
docker run --rm \
-v "$VOLUME:/src:ro" \
-v "$PWD/nm-cache:/dst" \
alpine:3 sh -c 'cp -a /src/. /dst/'
else
echo "::notice::Skipping node_modules cache dump because the devcontainer volume is empty or unavailable."
fi

Copilot uses AI. Check for mistakes.
@github-actions github-actions Bot added size/XL and removed size/M labels Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants