stacklok · kantord · Apr 28, 2026 · Apr 29, 2026 · Apr 30, 2026
diff --git a/.claude/agents/toolhive-studio-driver.md b/.claude/agents/toolhive-studio-driver.md
@@ -0,0 +1,59 @@
+---
+name: toolhive-studio-driver
+description: >-
+  Drive the toolhive-studio devcontainer for live UI exploration: launch
+  the dev environment, take screenshots, click and type via xdotool, tail
+  dev-server logs, and report observations back to the parent. Use when
+  the parent needs to delegate visual / interactive testing of the
+  ToolHive Studio Electron app — e.g. verifying a UI change, reproducing
+  a UI bug, or capturing screenshots for a PR. NOT for backend-only work
+  or for the standard host-side `pnpm start` flow.
+skills:
+  - devcontainer-dev
+tools: Bash, Read, Grep, Glob
+---
+
+# ToolHive Studio Driver
+
+You drive toolhive-studio's containerized dev environment to explore,
+test, or reproduce issues against the live Electron app. The
+`devcontainer-dev` skill is preloaded — it documents the launch flow,
+the wrapper recipes, and known gotchas. **Read it once at the start**
+and keep its rules in mind throughout the session.
+
+## Operating principles
+
+- **Drive only via `bash scripts/agent.sh` subcommands** (`shot`, `xdo`,
+  `tail`, `health`). These are pre-approved on a narrow surface. Use
+  raw `docker exec` only if a task genuinely needs something the
+  wrappers don't cover, and expect a permission prompt.
+- **Always start with a health check.** If the devcontainer isn't up,
+  launch it via `pnpm devContainer:dev` (long-running — run in
+  background). Don't try to drive a non-running stack.
+- **Screenshot before and after every action.** Pixel-coordinate
+  driving is brittle; modal/CSS changes invalidate prior coordinates.
+  Verify state-by-state instead of chaining many actions.
+- **Use cropped screenshots when zooming on a region** — vision models
+  perform meaningfully better on a focused crop than on the full
+  1920×1200 framebuffer. The wrapper takes `--crop WxH+X+Y`.
+- **HMR settle delay.** After editing renderer code, `sleep 2` before
+  re-screenshotting. If your edit touches anything under `main/`, the
+  live app will not pick it up without an Electron restart — call that
+  out and rely on unit tests for main-process changes.
+- **Save artifacts to `/tmp/<task-slug>-<step>.png`** so they survive
+  past your exit and the parent can read them. Reference them by
+  absolute path in your final report.
+
+## Reporting back
+
+When you finish a task, report concisely:
+
+- What you did, step by step (action → observation).
+- Paths to the screenshots that ground each observation.
+- Any unexpected app behavior, errors in `entrypoint`/`xvfb`/`fluxbox`
+  logs, or rough edges in the wrapper itself.
+- If the devcontainer was already up at the start, mention it (so the
+  parent knows nothing was launched on its behalf).
+
+Keep the report tight — under 400 words unless the parent asked for a
+deep dive. The screenshots are the evidence; the prose is the index.
diff --git a/.claude/skills/devcontainer-dev/SKILL.md b/.claude/skills/devcontainer-dev/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: devcontainer-dev
 description: Spin up and interact with ToolHive Studio's containerized dev environment (Xvfb + noVNC + DinD). Use when running, testing, or debugging the app in isolation — locally, in a git worktree, or in GitHub Codespaces; when touching `.devcontainer/*`, `scripts/devcontainer-*.sh`, or the `devContainer:dev` npm script; or when debugging "blank white window", "Docker daemon failed to start", or "Missing X server" errors in the devcontainer. The container is fully isolated: no host pnpm install, no host Docker socket, no host X11/GPU — experiment freely without contaminating the host.
-allowed-tools: Read, Grep, Glob, Bash
+allowed-tools: Read, Grep, Glob, Bash(bash scripts/agent.sh:*), Bash(pnpm devContainer:dev)
 ---
 
 # Containerized Dev Environment
@@ -142,21 +142,41 @@ docker exec "$CONTAINER" bash -c '
 
 The container has enough tools for an AI agent to both **see** and **interact with** the running Electron window without going through the browser / noVNC. Useful for headless testing, reproducing user-reported UI bugs, or validating a feature end-to-end.
 
-All commands run via `docker exec` against the container with `DISPLAY=:99` set (matches Xvfb's display).
+The recipes below are wrapped in `scripts/agent.sh`, which finds the devcontainer for *this* worktree by label, validates inputs, and execs only the documented commands. Pre-approving `Bash(bash scripts/agent.sh:*)` gives an agent the full driving surface without granting broad `Bash(docker exec *)` (which would let it run arbitrary commands in any container on the host). The raw `docker exec` recipes still work and are listed below as the explicit-approval fallback for ad-hoc cases the wrapper doesn't cover.
 
 ### See the screen (screenshots)
 
 ```bash
-# Take a PNG of the whole virtual framebuffer
+# Wrapped (recommended): screenshot to /tmp/shot.png on the host
+bash scripts/agent.sh shot
+
+# With an output path
+bash scripts/agent.sh shot ./shot.png
+
+# Cropped — vision models perform much better on a focused region than on
+# the full 1920×1200 framebuffer. Format is ImageMagick's WxH+X+Y.
+bash scripts/agent.sh shot --crop 800x600+560+300 ./dialog.png
+```
+
+Fallback (raw docker exec):
+
+```bash
 docker exec "$CONTAINER" bash -c 'DISPLAY=:99 import -window root /tmp/shot.png'
-# Copy it to the host for viewing / feeding to a vision model
 docker cp "$CONTAINER:/tmp/shot.png" /tmp/shot.png
 ```
 
 `import` is from ImageMagick. For a specific window only, use `xwininfo` to get the WID then `import -window <WID>`.
 
 ### See the window tree (what's there, where, which is focused)
 
+```bash
+# Wrapped: xdotool passthrough
+bash scripts/agent.sh xdo search --class ToolHive
+bash scripts/agent.sh xdo getactivewindow getwindowname
+```
+
+Fallback (raw docker exec):
+
 ```bash
 docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xwininfo -root -tree'
 docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool getactivewindow getwindowname'
@@ -167,23 +187,61 @@ The main app window has `WM_CLASS=ToolHive` (see gotcha below). Its geometry is
 
 ### Interact with the UI (mouse, keyboard)
 
+`xdo` is a passthrough to `xdotool` inside the container — its argument surface is intentionally narrow (X events only, no shell, no file access) so a single wrapper covers click / key / type / mousemove without per-verb validation cost.
+
 ```bash
-# Move the mouse and click
-docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool mousemove 400 300 click 1'
+# Click at coordinates
+bash scripts/agent.sh xdo mousemove --sync 400 300 click 1
 
 # Type into whatever has focus
+bash scripts/agent.sh xdo type --delay 20 'hello world'
+
+# Send a keystroke (DevTools)
+bash scripts/agent.sh xdo key ctrl+shift+i
+
+# Focus the main window by class
+bash scripts/agent.sh xdo search --class ToolHive windowactivate
+```
+
+Fallback (raw docker exec):
+
+```bash
+docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool mousemove 400 300 click 1'
 docker exec "$CONTAINER" bash -c "DISPLAY=:99 xdotool type --delay 20 'hello world'"
+docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool key ctrl+shift+i'
+```
+
+### Tail a log
+
+The dev-stack writes canonical logs under `/tmp/*.log` (see [Log files](#log-files-written-by-the-entrypoint)).
+
+```bash
+# Last 50 lines (default)
+bash scripts/agent.sh tail entrypoint
 
-# Send a keystroke
-docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool key ctrl+shift+i'   # DevTools
+# Last N lines
+bash scripts/agent.sh tail xvfb 200
+```
+
+Allowed log names: `xvfb`, `fluxbox`, `x11vnc`, `websockify`, `keyring`, `entrypoint`.
+
+### Health check
+
+One-call readiness check — Electron running, `thv serve` running, the ToolHive X window mapped, and noVNC reachable. Exits non-zero if anything is down.
 
-# Focus the main window by name
-docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool search --name ToolHive windowactivate'
+```bash
+bash scripts/agent.sh health
 ```
 
 ### Typical agent loop
 
-Screenshot → feed to vision model → decide next action → `xdotool` → screenshot again. The usual caveats apply: pixel-coordinate automation is fragile, and CSS changes or modal dialogs can throw it off. For robust long-term automation, prefer a DOM-level approach (e.g. Chrome DevTools Protocol against Electron's remote debugging port — not currently wired in, see [Future: CDP access](#future-cdp-access)).
+Screenshot → Read it with the Read tool → reason about what you see → act via `xdo` → screenshot again. Pixel-coordinate driving is brittle (modal/CSS changes invalidate prior coordinates), so verify state-by-state instead of chaining many actions. Use cropped screenshots when zooming on a region — vision models perform meaningfully better on a focused crop than on the full 1920×1200 framebuffer.
+
+For robust DOM-level automation (querying / clicking by selector, scraping state), prefer a CDP-based approach once it's wired in — see [Future: CDP access](#future-cdp-access). Pixel ops via `xdo` and `shot` remain the right tool for OS-level interactions, multi-window coordination, and visual-regression checks.
+
+### After editing renderer code
+
+Vite HMR applies the change in-place — `sleep 2` before re-screenshotting so the UI has time to repaint. If your edit touches `main/`, a manual Electron restart is needed (~15s); rely on unit tests for main-process changes when driving the live app.
 
 ### Future: CDP access
 

diff --git a/.codex/skills/devcontainer-dev/SKILL.md b/.codex/skills/devcontainer-dev/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: devcontainer-dev
 description: Spin up and interact with ToolHive Studio's containerized dev environment (Xvfb + noVNC + DinD). Use when running, testing, or debugging the app in isolation — locally, in a git worktree, or in GitHub Codespaces; when touching `.devcontainer/*`, `scripts/devcontainer-*.sh`, or the `devContainer:dev` npm script; or when debugging "blank white window", "Docker daemon failed to start", or "Missing X server" errors in the devcontainer. The container is fully isolated: no host pnpm install, no host Docker socket, no host X11/GPU — experiment freely without contaminating the host.
-allowed-tools: Read, Grep, Glob, Bash
+allowed-tools: Read, Grep, Glob, Bash(bash scripts/agent.sh:*), Bash(pnpm devContainer:dev)
 ---
 
 # Containerized Dev Environment
@@ -142,21 +142,41 @@ docker exec "$CONTAINER" bash -c '
 
 The container has enough tools for an AI agent to both **see** and **interact with** the running Electron window without going through the browser / noVNC. Useful for headless testing, reproducing user-reported UI bugs, or validating a feature end-to-end.
 
-All commands run via `docker exec` against the container with `DISPLAY=:99` set (matches Xvfb's display).
+The recipes below are wrapped in `scripts/agent.sh`, which finds the devcontainer for *this* worktree by label, validates inputs, and execs only the documented commands. Pre-approving `Bash(bash scripts/agent.sh:*)` gives an agent the full driving surface without granting broad `Bash(docker exec *)` (which would let it run arbitrary commands in any container on the host). The raw `docker exec` recipes still work and are listed below as the explicit-approval fallback for ad-hoc cases the wrapper doesn't cover.
 
 ### See the screen (screenshots)
 
 ```bash
-# Take a PNG of the whole virtual framebuffer
+# Wrapped (recommended): screenshot to /tmp/shot.png on the host
+bash scripts/agent.sh shot
+
+# With an output path
+bash scripts/agent.sh shot ./shot.png
+
+# Cropped — vision models perform much better on a focused region than on
+# the full 1920×1200 framebuffer. Format is ImageMagick's WxH+X+Y.
+bash scripts/agent.sh shot --crop 800x600+560+300 ./dialog.png
+```
+
+Fallback (raw docker exec):
+
+```bash
 docker exec "$CONTAINER" bash -c 'DISPLAY=:99 import -window root /tmp/shot.png'
-# Copy it to the host for viewing / feeding to a vision model
 docker cp "$CONTAINER:/tmp/shot.png" /tmp/shot.png
 ```
 
 `import` is from ImageMagick. For a specific window only, use `xwininfo` to get the WID then `import -window <WID>`.
 
 ### See the window tree (what's there, where, which is focused)
 
+```bash
+# Wrapped: xdotool passthrough
+bash scripts/agent.sh xdo search --class ToolHive
+bash scripts/agent.sh xdo getactivewindow getwindowname
+```
+
+Fallback (raw docker exec):
+
 ```bash
 docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xwininfo -root -tree'
 docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool getactivewindow getwindowname'
@@ -167,23 +187,61 @@ The main app window has `WM_CLASS=ToolHive` (see gotcha below). Its geometry is
 
 ### Interact with the UI (mouse, keyboard)
 
+`xdo` is a passthrough to `xdotool` inside the container — its argument surface is intentionally narrow (X events only, no shell, no file access) so a single wrapper covers click / key / type / mousemove without per-verb validation cost.
+
 ```bash
-# Move the mouse and click
-docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool mousemove 400 300 click 1'
+# Click at coordinates
+bash scripts/agent.sh xdo mousemove --sync 400 300 click 1
 
 # Type into whatever has focus
+bash scripts/agent.sh xdo type --delay 20 'hello world'
+
+# Send a keystroke (DevTools)
+bash scripts/agent.sh xdo key ctrl+shift+i
+
+# Focus the main window by class
+bash scripts/agent.sh xdo search --class ToolHive windowactivate
+```
+
+Fallback (raw docker exec):
+
+```bash
+docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool mousemove 400 300 click 1'
 docker exec "$CONTAINER" bash -c "DISPLAY=:99 xdotool type --delay 20 'hello world'"
+docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool key ctrl+shift+i'
+```
+
+### Tail a log
+
+The dev-stack writes canonical logs under `/tmp/*.log` (see [Log files](#log-files-written-by-the-entrypoint)).
+
+```bash
+# Last 50 lines (default)
+bash scripts/agent.sh tail entrypoint
 
-# Send a keystroke
-docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool key ctrl+shift+i'   # DevTools
+# Last N lines
+bash scripts/agent.sh tail xvfb 200
+```
+
+Allowed log names: `xvfb`, `fluxbox`, `x11vnc`, `websockify`, `keyring`, `entrypoint`.
+
+### Health check
+
+One-call readiness check — Electron running, `thv serve` running, the ToolHive X window mapped, and noVNC reachable. Exits non-zero if anything is down.
 
-# Focus the main window by name
-docker exec "$CONTAINER" bash -c 'DISPLAY=:99 xdotool search --name ToolHive windowactivate'
+```bash
+bash scripts/agent.sh health
 ```
 
 ### Typical agent loop
 
-Screenshot → feed to vision model → decide next action → `xdotool` → screenshot again. The usual caveats apply: pixel-coordinate automation is fragile, and CSS changes or modal dialogs can throw it off. For robust long-term automation, prefer a DOM-level approach (e.g. Chrome DevTools Protocol against Electron's remote debugging port — not currently wired in, see [Future: CDP access](#future-cdp-access)).
+Screenshot → Read it with the Read tool → reason about what you see → act via `xdo` → screenshot again. Pixel-coordinate driving is brittle (modal/CSS changes invalidate prior coordinates), so verify state-by-state instead of chaining many actions. Use cropped screenshots when zooming on a region — vision models perform meaningfully better on a focused crop than on the full 1920×1200 framebuffer.
+
+For robust DOM-level automation (querying / clicking by selector, scraping state), prefer a CDP-based approach once it's wired in — see [Future: CDP access](#future-cdp-access). Pixel ops via `xdo` and `shot` remain the right tool for OS-level interactions, multi-window coordination, and visual-regression checks.
+
+### After editing renderer code
+
+Vite HMR applies the change in-place — `sleep 2` before re-screenshotting so the UI has time to repaint. If your edit touches `main/`, a manual Electron restart is needed (~15s); rely on unit tests for main-process changes when driving the live app.
 
 ### Future: CDP access