OS-native vision automation agent — record yourself, teach the AI, and let it repeat your workflows autonomously.
Built with Tauri 2 (Rust backend) + React (frontend). Runs natively on macOS with full screen capture, virtual cursor actuation, and a floating HUD that stays out of your way.
- Captures the screen, sends it to a vision model (via OpenRouter or direct API), and executes the model's decision — all in a tight loop.
- The model now selects named tools instead of emitting raw action JSON. Core tools are
click_target,type_text,run_shell, andtask_done, plus a catalog of named shortcut tools such asbrowser_new_tabandsystem_open_spotlight. - Tool calls are still normalized into the internal action types click, hotkey, type, shell, and none for execution, telemetry, and session logs.
- Step history is passed between iterations so the agent remembers what it already did.
- Configurable confidence threshold — low-confidence actions are rejected automatically.
- Window context awareness — detects the currently active application and window title, feeding it to the model so it understands the current context.
- App state awareness — queries scriptable macOS apps (Spotify, Chrome, Safari, Finder, etc.) via AppleScript for their internal state (player state, current track, active URL, folder path, etc.). Unknown apps get a generic focused-element probe. All queries have a 2-second kill timeout to prevent hangs.
The agent uses macOS CGEvent synthetic clicks instead of moving your physical mouse cursor. This means:
- Your cursor stays put — the agent clicks at target coordinates via the OS event stream, not by hijacking your mouse.
- Clicks bypass overlays — synthetic events go directly to the target window, even through the always-on-top overlay.
- You keep full control — use your mouse normally while the agent works.
MOUSE_EVENT_CLICK_STATEis set for proper single-click registration.- Falls back to
enigomouse actuation on non-macOS platforms.
Every click goes through a multi-stage precision pipeline:
- Pixel coordinates — the vision model returns
x_norm/y_normas pixel coordinates within the (possibly downscaled) screenshot image.(0, 0)is the top-left corner. - Scale-factor-aware conversion — pixel coordinates are scaled back to original screenshot pixels, then translated to macOS logical points, accounting for Retina scaling (
2.0x,3.0x) and multi-monitor offsets. - Confidence gating — each action carries a
confidencescore (0–1). Clicks below the threshold (default0.60, configurable viaAGENT_CONFIDENCE_THRESHOLD) are automatically rejected. - Image compression pipeline — screenshots are compressed before inference to optimize speed and cost:
- Adaptive downscaling — images are resized to fit within
AGENT_INFER_MAX_DIM(default2048px, range640–4096) while preserving aspect ratio. - Triangle filter — uses bilinear interpolation for clean downscaling without aliasing artifacts.
- PNG re-encoding — compressed images are re-encoded as PNG to minimize payload size while staying lossless.
- Adaptive downscaling — images are resized to fit within
- Self-correction — the system prompt instructs the model to detect missed clicks (no state change) and adjust coordinates toward the element center on retry.
- Visual verification — the transparent overlay shows a pulsing cursor at the exact click target, so you can visually confirm accuracy.
- Full telemetry — every click logs: pixel coords, screenshot dims, scale factor, monitor origin, and final point in logical coordinates.
Every inference call assembles a rich prompt from 7 distinct sources — this is what the model "sees" and "knows" on each step:
| Layer | Source | Injected Into | Description |
|---|---|---|---|
| 1. System Prompt | system_prompt.txt (loaded via include_str!) |
system message |
5-gate decision flow (objective → login detection → app check → last action → next action), tool-calling instructions, Spotlight rules, click accuracy heuristics, and shortcut-tool preference |
| 2. Screenshot | CGWindowListCreateImage (excludes HUD) / xcap fallback |
user message (image part) |
PNG of the primary monitor with HUD/overlay excluded, adaptively downscaled to AGENT_INFER_MAX_DIM (default 2048px). Encoded as base64 data URI |
| 3. Task Instruction | User input (HUD / dashboard) | user message (text) |
The natural-language goal, e.g. "Open Chrome and go to github.com" |
| 4. Coordinate System | Computed from screenshot dims | user message (text) |
Pixel coordinate bounds: "The screenshot image is {W}x{H} pixels… (0,0) is top-left…" |
| 5. OS Context | gather_os_context() via AppleScript |
user message (text) |
Frontmost app name + window title, running GUI apps list, open window names in frontmost app, app-specific state (e.g. Spotify track, Chrome URL, Finder path — queried via per-app AppleScript with 2s kill timeout), system time (unix) |
| 6. App-Specific Shortcuts | shortcuts::get_or_fetch_global() via LLM |
user message (text) |
Top 20 macOS keyboard shortcuts for the frontmost app, dynamically fetched on first encounter, parsed into app_* tool definitions when possible, and session-cached |
| 7. Step History | stepHistory[] (in-memory, frontend) + getWindowContext() |
user message (text) |
Accumulated action log from prior steps in the current run: "Step 1: click (x,y) — reason", "Step 2: hotkey Meta+t+Meta — reason", etc. Includes the currently active window context at the top |
What is NOT persisted: The full model prompt, raw model response, and thinking/reasoning traces are not saved to disk. Only the extracted VisionAction (action, coordinates, confidence, reason) survives — optionally written to activity_log.json per session. Step history lives in-memory for the duration of one agent run only.
Vision inference routes through OpenRouter, so you can switch models without changing API keys. Currently supported:
| ID | Label |
|---|---|
mistralai/mistral-large-2512 |
Mistral Large 3 (default) |
qwen/qwen3.5-35b-a3b |
Qwen 3.5 35B A3B |
openai/gpt-5.4 |
GPT-5.4 |
anthropic/claude-sonnet-4.6 |
Claude Sonnet 4.6 |
google/gemini-3.1-pro-preview-customtools |
Gemini 3.1 Pro Preview (Custom Tools) |
- Model selection persists across sessions via
localStorage. - Saved sessions remember which model was used and auto-select it on replay.
- Cost tracking — each inference returns token usage and estimated USD cost via OpenRouter pricing API, with fallback to a built-in pricing table for known models (GPT-4o, Claude, Gemini, Mistral, Llama). Totals are logged per run and per replay.
The agent always prefers keyboard shortcuts over clicking — clicking is the last resort.
- Dynamic shortcut discovery — on first encounter with any app (Spotify, Chrome, VS Code, etc.), the agent makes a lightweight LLM call to fetch the top 20 macOS shortcuts for that app.
- Session-scoped cache — shortcuts are cached per app name in memory, so subsequent agent steps with the same app pay zero latency.
- System apps skipped — loginwindow, Dock, SystemUIServer, and other system processes are automatically excluded.
- Static fallback — the system prompt also includes a built-in catalog of named shortcut tools for browser, macOS, editing, and navigation actions.
- Priority order — app-specific
app_*tools → built-in shortcut tools → clicking (last resort). - Wrong-app safety — if the frontmost app is not the target app, the agent checks the Dock and installed native apps first (Linear, Slack, Notion, Discord, Spotify, etc.) before falling back to Spotlight. Prefers native macOS apps over browser-based alternatives.
The model is expected to call exactly one tool on each step. These are the built-in tools currently registered in vision.rs.
flowchart LR
APP["Frontmost app name"]
CACHE["Session shortcut cache"]
FETCH["OpenRouter shortcut lookup"]
PARSE["Parse 'Shortcut - Description' lines"]
APPTOOLS["Dynamic app_* tools"]
BUILTIN["Built-in local tools<br/>browser_*, system_*, edit_*, press_*, move_*"]
CORE["Core tools<br/>click_target, type_text, run_shell, task_done"]
REGISTRY["Per-step tool registry"]
MODEL["Vision model"]
ACTION["Resolved internal action<br/>click / hotkey / type / shell / none"]
APP --> CACHE
CACHE -->|miss| FETCH
FETCH --> PARSE
CACHE -->|hit| PARSE
PARSE --> APPTOOLS
BUILTIN --> REGISTRY
CORE --> REGISTRY
APPTOOLS --> REGISTRY
REGISTRY --> MODEL
MODEL --> ACTION
| Tool | Meaning |
|---|---|
click_target |
Click a target using pixel coordinates from the current screenshot |
type_text |
Type text into the currently focused control |
run_shell |
Run a shell command when the user explicitly asked for terminal/CLI work |
task_done |
Mark the task complete when the goal is visually confirmed |
| Tool | Shortcut |
|---|---|
browser_focus_address_bar |
Cmd+L |
browser_new_tab |
Cmd+T |
browser_close_tab |
Cmd+W |
browser_reload |
Cmd+R |
browser_back |
Cmd+[ |
browser_forward |
Cmd+] |
| Tool | Shortcut |
|---|---|
system_open_spotlight |
Cmd+Space |
system_quit_app |
Cmd+Q |
edit_select_all |
Cmd+A |
edit_copy |
Cmd+C |
edit_paste |
Cmd+V |
file_save |
Cmd+S |
| Tool | Shortcut |
|---|---|
press_return |
Return |
press_escape |
Escape |
press_tab |
Tab |
press_shift_tab |
Shift+Tab |
press_space |
Space |
press_backspace |
Backspace |
press_delete |
Delete |
move_up |
Up |
move_down |
Down |
move_left |
Left |
move_right |
Right |
press_home |
Home |
press_end |
End |
press_page_up |
PageUp |
press_page_down |
PageDown |
- When app shortcuts can be parsed into a single key combo, the backend registers additional tools named like
app_new_tab,app_toggle_sidebar, orapp_move_tab_left. - These dynamic tools are derived from
Shortcut - Descriptionlines returned byget_app_shortcuts_cmd. - If a shortcut line cannot be parsed into one combo, it is kept as prompt context only and does not become a callable tool.
A compact, transparent pill that floats above everything — your command center without leaving your workflow.
| Control | What it does |
|---|---|
| ☰ Menu | Toggle the main dashboard window |
| ◉ Record | Open the session recording panel |
| ✏️ Command | Type an instruction and run the agent loop |
| ▼ Activity | Live timestamped feed of every agent step |
| 🎯 Overlay | Toggle the visual cursor overlay (red when off) |
| ◄ Collapse | Shrink HUD to a single circle, click to expand |
- Elapsed timer shows a running clock (▶ 0:05) during agent runs and (● 0:12) during recording.
- Activity feed shows HH:MM:SS timestamps on every step — auto-opens when the agent starts a task.
- Blue glow border pulses around the entire screen while the agent is actively running.
- Model selector — pick a model directly in the HUD's command or record panels.
- Collapsible to a 48px circle centered on screen.
Record yourself performing a task — the AI watches, learns, and can repeat it.
The backend now uses a single session recording implementation in recording.rs. The older screenshot-only recorder has been removed, so the dashboard and HUD both talk to the same session lifecycle.
Recording captures:
- Screen frames (configurable FPS)
- Mouse movements, clicks, and scroll events (via
rdev) - Every keystroke — key presses and releases
- Session name, instruction, task context, and selected model
Replay features:
- Select any saved session and replay it with the AI
- Auto-fills the instruction from the recording so the AI knows the goal
- Repeat count: run once, N times, or ∞ (infinite loop until stopped)
- Activity logs saved per session for post-run review
- Save Runs — toggle in the command panel so every agent run automatically records as a replayable session
- Screen captures use macOS
CGWindowListCreateImagewithkCGWindowListOptionOnScreenBelowWindowto capture the entire desktop excluding the HUD and overlay windows. - The capture function finds the app's topmost window (highest z-layer) via
CGWindowListCopyWindowInfomatching the process PID, then captures everything below it. - Falls back to
xcapfull-screen capture if the native API fails. - No more hiding/showing windows — zero flicker, clean captures every time.
- Full-screen transparent window spanning all monitors.
- Visual cursor shows exactly where the agent clicked — with pulse animations.
- Blue glow border — pulsing blue border around the entire screen while the agent is in control.
- The overlay is
pointer-events: none— the agent's CGEvent clicks pass right through it.
- Global E-STOP:
Cmd+Shift+Esc— immediately halts all agent actions. - Restore window:
Cmd+Shift+Enter— brings the dashboard back if minimized. - Max action cap (30 per run) with auto-stop.
- Per-action confidence threshold gating.
- The agent can run CLI commands via
/bin/sh -cwhen a task is better handled through the terminal. - WhiteCircle integration — every shell command is validated through WhiteCircle's guardrail API before execution:
- Input guard: blocks unsafe/malicious commands before they run.
- Output guard: screens command output for sensitive data leaks.
- Strict mode (
WHITECIRCLE_STRICT=true): hard-blocks commands when the guardrail API is unreachable. - Graceful degradation: without an API key, commands execute with a logged warning.
- 10-second timeout per command, 4KB output cap to protect model context.
flowchart LR
subgraph FE["Frontend Windows"]
MAIN["Main dashboard<br/>Run / Sessions / Dev Tools"]
HUD["Floating HUD<br/>command, record, replay"]
OVERLAY["Transparent overlay<br/>cursor pulse + blue border"]
RUNNER["agentRunner.ts<br/>shared capture → infer → execute loop"]
MAIN --> RUNNER
HUD --> RUNNER
end
subgraph TAURI["Tauri / Rust backend"]
ROOT["main.rs<br/>composition root, state, command registry"]
VISION["vision.rs<br/>prompt build, image compression, model call, tool parsing"]
OSX["os_context.rs<br/>frontmost app, AppleScript state, window context"]
RECORD["recording.rs<br/>session capture, input listener, manifests, activity logs"]
SHELL["shell.rs<br/>WhiteCircle guard + shell execution"]
SHORTCUTS["shortcuts.rs<br/>LLM shortcut lookup + cache"]
MODELS["models.rs<br/>shared request/response types"]
ROOT --> VISION
ROOT --> OSX
ROOT --> RECORD
ROOT --> SHELL
ROOT --> SHORTCUTS
ROOT --> MODELS
end
subgraph LOOP["Agent decision loop"]
CAP["capture_primary_cmd"]
PROMPT["instruction + task context<br/>+ OS context + shortcut tools + step history"]
INFER["infer_click_cmd"]
ROUTER{"tool call"}
CLICK["CGEvent click"]
HOTKEY["named shortcut tool<br/>→ enigo hotkey"]
TYPE["enigo text input"]
SH["guarded shell command"]
NONE["stop"]
CAP --> PROMPT --> INFER --> ROUTER
ROUTER --> CLICK
ROUTER --> HOTKEY
ROUTER --> TYPE
ROUTER --> SH
ROUTER --> NONE
end
subgraph STORAGE["Session persistence"]
MANIFEST["manifest.json"]
INPUTS["input_events.json"]
ACTIVITY["activity_log.json"]
FRAMES["monitor-*/frame-*.png"]
end
RUNNER --> ROOT
CAP --> OVERLAY
CLICK --> OVERLAY
RECORD --> FRAMES
RECORD --> INPUTS
RECORD --> MANIFEST
RUNNER --> ACTIVITY
OSX --> PROMPT
SHORTCUTS --> PROMPT
SHELL --> SH
The Rust backend is no longer a monolithic main.rs. main.rs is now the Tauri composition root: it owns shared runtime state, low-level input/capture helpers, plugin setup, and command registration. The feature logic lives in focused modules:
src-tauri/src/
├── main.rs # app setup, shared helpers, command wiring, CGWindowList capture
├── models.rs # shared request/response and runtime types
├── vision.rs # prompt assembly, screenshot preprocessing, tool registry, response parsing, cost estimation
├── system_prompt.txt # externalized system prompt (loaded via include_str!)
├── os_context.rs # AppleScript-based app/window/system context
├── recording.rs # session recording, input capture, manifest/activity persistence
├── shell.rs # WhiteCircle validation and shell command execution
└── shortcuts.rs # app shortcut discovery and in-memory caching
This split matters for debugging: session issues stay in recording.rs, model/output issues stay in vision.rs, and macOS context problems stay in os_context.rs instead of being buried in one multi-thousand-line file.
The frontend is split into focused modules for maintainability:
src/
├── App.tsx # Router — delegates to window components
├── types.ts # Shared TypeScript type definitions
├── constants.ts # Labels, model options, dimensions
├── lib/
│ ├── tauri.ts # Tauri invoke wrappers, window management
│ └── agentRunner.ts # Deduplicated capture→infer→execute loop
├── hooks/
│ └── useAgentLoop.ts # Agent loop state & controls
├── components/
│ ├── OverlayWindow.tsx # Transparent cursor overlay
│ ├── HudWindow.tsx # Floating HUD pill (activity, command, record)
│ ├── MainApp.tsx # Dashboard orchestrator
│ ├── RunTab.tsx # Run tab — health + live command
│ ├── SessionsTab.tsx # Sessions tab — recording + replay
│ ├── DevTab.tsx # Dev Tools tab — step controls + raw data
│ ├── HealthGrid.tsx # Reusable status chip grid
│ ├── ModelActivityPanel.tsx # Model activity sidebar
│ └── ActivityLogPanel.tsx # Activity log sidebar
├── HudWidgets.tsx # ElapsedTimer, ActivityFeed components
├── shortcuts.tsx # App shortcut fetching
├── taskRunLock.ts # Task run mutex via backend
├── main.tsx # React entry point
└── styles.css # All styling (dark/light, glassmorphism)
The agent loop logic (capture → infer → execute → emit) lives in a single agentRunner.ts module, shared by both the HUD and the main dashboard — no duplication.
session-<unix-ms>/
manifest.json # name, instruction, model, fps, frame count, duration, input event count
activity_log.json # agent step history
input_events.json # mouse moves, clicks, key presses/releases, scroll events
monitor-<id>/
frame-000001.png
frame-000002.png
...
| Tab | Purpose |
|---|---|
| Run | Permissions, API key status, E-STOP, overlay/HUD toggles, model selection, action counter, live command execution |
| Sessions | Recording status cards, saved session list with names/instructions/input counts, replay with instruction auto-fill |
| Dev Tools | Step-by-step capture/infer/click controls, raw state inspection |
| Layer | Technology |
|---|---|
| Framework | Tauri 2 (Rust + WebView) |
| Frontend | React + TypeScript + Vite |
| Screen Capture | CGWindowListCreateImage (HUD-excluded) / xcap (fallback) |
| Mouse Click | core-graphics CGEvent (virtual cursor) |
| Keyboard Actuation | enigo |
| Input Event Capture | rdev (global mouse/keyboard listener) |
| Vision Model | OpenRouter (any model) / direct API |
| HTTP Client | reqwest |
| AI Guardrails | WhiteCircle API (input/output protection) |
| Styling | Vanilla CSS with glassmorphism, dark mode |
Create .env in repo root:
OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
OPENROUTER_API_BASE=https://openrouter.ai/api/v1
AGENT_CONFIDENCE_THRESHOLD=0.60
AGENT_INFER_MAX_DIM=2048
# WhiteCircle Guardrails (for shell command safety)
WHITECIRCLE_API_KEY=YOUR_WHITECIRCLE_KEY
WHITECIRCLE_API_BASE=https://eu.whitecircle.ai/api/v1
WHITECIRCLE_STRICT=truebun install
bun run tauri:devThis repo is Bun-only on the frontend side. Use bun run build for production assets and avoid npm, pnpm, or yarn.
Requires macOS with Screen Recording and Accessibility permissions (prompted on first launch).
| Command | Description |
|---|---|
capture_primary_cmd |
Capture primary monitor screenshot (HUD-excluded via CGWindowList) |
infer_click_cmd |
Send screenshot + instruction to vision model and resolve one tool call |
execute_real_click_cmd |
Perform CGEvent synthetic click at pixel coords |
press_keys_cmd |
Execute resolved shortcut key sequences |
type_text_cmd |
Type text string |
run_shell_cmd |
Execute shell command with WhiteCircle guard |
get_frontmost_app_cmd |
Get active window name and title |
open_path_cmd |
Open a folder or file in Finder |
export_markdown_cmd |
Export recorded activity as Markdown |
| Command | Description |
|---|---|
recordings_root_cmd |
Return the root folder for all saved sessions |
start_session_cmd |
Start recording (frames + rdev input capture) |
stop_session_cmd |
Stop recording, save manifest + input events |
session_status_cmd |
Get current recording status |
list_sessions_cmd |
List all saved sessions |
load_session_cmd |
Load a specific session manifest |
delete_session_cmd |
Delete a saved session |
save_activity_log_cmd |
Save agent activity log to session |
load_activity_log_cmd |
Load saved activity log from session |
| Command | Description |
|---|---|
check_permissions_cmd |
Check macOS permissions |
request_permissions_cmd |
Prompt for permissions |
set_estop_cmd |
Toggle emergency stop |
get_runtime_state_cmd |
Get runtime state (E-STOP, action count) |
get_app_shortcuts_cmd |
Fetch/cache keyboard shortcuts for an app via LLM |
clear_shortcuts_cache_cmd |
Clear the in-memory shortcuts cache |
Prev. Agenticify. - kms after 14hrs of straight grind at this dumbahhh hackathon, nothign feels better than to coming back to doing homework after this L fml