Autonomous GUI agent — give it a task, it operates the desktop.
Visual memory • One-shot UI learning • Any LLM provider • Local or VM
🇺🇸 English · 🇨🇳 中文
- [2026-04-14] 🏆 OSWorld Multi-Apps 79.8% — 72.6/91 evaluated tasks. 4-phase step loop + CLI session persistence + PRESERVE FORMAT work habit. Results →
- [2026-04-18] 📦 OpenProgram — Agentic Programming graduated from concept to product: repo/package/CLI renamed to OpenProgram. Agentic Programming remains the paradigm name; OpenProgram is the shippable framework. Harness imports migrated to
from openprogram import .... - [2026-04-07] 🤖 Agent-native architecture — Rebuilt execution core on the Agentic Programming paradigm, unifying GUI perception and free-form agent actions under a single decision loop. Eliminates task-specific scripting.
- [2026-03-30] 📐 ImageContext coordinate system — Replaced dual-space model with
ImageContextclass; scale-independent cropping, fixes crop bugs on non-fullscreen images. - [2026-03-29] 🎬 v0.3 — Unified Actions & Cross-Platform GUI —
gui_action.pyas single entry point. Platform backends auto-selected via--remote. - [2026-03-23] 🏆 OSWorld Chrome 93.5% — One attempt (43/46), 97.8% two attempts (45/46). Results →
- [2026-03-23] 🔄 Memory overhaul — Split storage, automatic component forgetting, state merging by Jaccard similarity.
- [2026-03-10] 🚀 Initial release — GPA-GUI-Detector + Apple Vision OCR + template matching + per-app visual memory.
A CLI tool that turns any LLM into a GUI automation agent. You give it a natural-language task, it operates the desktop autonomously — screenshots, clicks, types, verifies, and repeats until the task is done.
gui-agent "Install the Orchis GNOME theme"
gui-agent --vm http://172.16.82.132:5000 "Open GitHub in Chrome and Python docs"Designed as an LLM tool. The intended workflow is:
- An LLM (Claude Code, OpenClaw, etc.) receives a GUI task from the user
- The LLM's skill/prompt tells it to call
gui-agentas a CLI tool gui-agenthandles all GUI perception and interaction internally- The LLM gets back a result summary
The LLM doesn't need to know how GUI automation works — it just calls the tool.
- Visual memory — UI components are detected once, labeled by a VLM, and stored as templates. On subsequent encounters, template matching replaces expensive re-detection (~5x faster, ~60x fewer tokens).
- State transitions — The UI is modeled as a graph of states (sets of visible components). Successful action sequences are recorded as transitions for future replay.
- 4-phase step loop — Each step follows: Observe (screenshot + detect) → Verify (check previous action) → Plan (LLM decides next action) → Dispatch (execute). All phases are
@agentic_functioncalls with structured feedback between steps. - Provider-agnostic — Works with Claude Code CLI, OpenClaw, Anthropic API, or OpenAI API. Auto-detects the best available provider.
Multi-Apps domain: 79.8% (72.6/91 evaluated tasks)
| Metric | Value |
|---|---|
| Total tasks | 101 |
| Evaluated | 91 |
| Blocked (no credentials) | 10 |
| Passed (score = 1.0) | 63 |
| Partial (0 < score < 1.0) | 11 |
Full results: benchmarks/osworld/multi_apps.md
pip install git+https://github.com/Fzkuji/GUI-Agent-Harness.gitAll dependencies are installed automatically, including OpenProgram (the Agentic Programming runtime), ultralytics (GPA-GUI-Detector), OpenCV, Pillow, etc.
Local development: the upstream PyPI package name is
openprogram. If you're building against an unreleased branch, install OpenProgram first (pip install -e /path/to/OpenProgram) and thenpip install -e . --no-depsinside this repo to avoid the git-URL fetch.
For development (editable install):
git clone https://github.com/Fzkuji/GUI-Agent-Harness.git
cd GUI-Agent-Harness
pip install -e .GUI Agent Harness needs an LLM to make decisions. Install at least one provider:
Option A: Claude Code CLI (recommended)
npm install -g @anthropic-ai/claude-code
claude loginUses your Claude subscription — no per-token cost. The agent runs as claude -p under the hood.
Option B: Anthropic API
export ANTHROPIC_API_KEY=sk-ant-...Pay-per-token. Set the key in your shell profile for persistence.
Option C: OpenAI API
export OPENAI_API_KEY=sk-...The system auto-detects the best available provider. You can also force one with --provider.
macOS:
- Grant accessibility permissions: System Settings → Privacy & Security → Accessibility → add your Terminal app
- Apple Vision OCR works automatically (no extra install)
Linux:
- Install EasyOCR for text detection:
pip install easyocr - Or install with:
pip install "gui-agent-harness[ocr] @ git+https://github.com/Fzkuji/GUI-Agent-Harness.git"
# Local desktop
gui-agent "Open Firefox and go to google.com"
# Remote VM (e.g., OSWorld)
gui-agent --vm http://VM_IP:5000 "Install the Orchis GNOME theme"
# Specify provider and model
gui-agent --provider claude-code --model opus "Send hello in WeChat"GUI Agent Harness is designed to be called by an LLM as a tool. After pip install, register the project as a skill so your LLM can discover and use it.
LLM skill systems typically scan a skills directory for subdirectories containing a SKILL.md file. To register GUI Agent Harness, copy or symlink the project into your LLM's skills directory:
# Example: copy into OpenClaw's skills directory
cp -r GUI-Agent-Harness ~/.openclaw/skills/gui-agent
# Or symlink (recommended — stays in sync with git)
ln -s /path/to/GUI-Agent-Harness ~/.openclaw/skills/gui-agentClaude Code auto-discovers SKILL.md from the current working directory or configured skill paths:
# Option 1: work from the project directory (auto-discovered)
cd /path/to/GUI-Agent-Harness
# Option 2: add to Claude Code's skill search paths
claude config set skillPaths '["<path-to-GUI-Agent-Harness>"]'Once registered, the LLM reads SKILL.md and knows when and how to call gui-agent — no further configuration needed.
gui-agent [OPTIONS] TASK
Arguments:
TASK Natural language task description
Options:
--vm URL Remote VM HTTP API (e.g., http://172.16.82.132:5000)
--provider NAME Force LLM provider: claude-code, openclaw, anthropic, openai
--model NAME Override model name (e.g., opus, sonnet, gpt-4o)
--max-steps N Max actions before stopping (default: 15)
--app NAME App name for component memory (default: desktop)
gui-agent "task description"
│
▼
gui_agent() ← @agentic_function, drives the loop
│
├── for step in 1..max_steps:
│ │
│ ▼
│ gui_step() ← @agentic_function, orchestration
│ │
│ ├── 1. Observe (Python) — screenshot + detect + match + state ID
│ ├── 2. Verify (LLM) — check previous action's result
│ ├── 3. Plan (LLM) — decide next action
│ └── 4. Dispatch (Python) — execute: click/type/scroll/general
│ │
│ ▼
│ build_step_feedback() ← structured result → next iteration
│
└── return result summary
Observe — Pure Python. Takes a screenshot, runs GPA-GUI-Detector + OCR, matches against stored component templates, identifies the current UI state.
Verify — LLM call. Examines the screenshot after the previous action. Reports whether the action succeeded. Does not decide task completion.
Plan — LLM call. Sees the screenshot, detected components, verification result, and known state transitions. Chooses one action (click, type, scroll, general, done).
Dispatch — Pure Python. Executes the planned action. For clicks, uses template matching to find precise coordinates. For general, delegates to the LLM with full tool access (Bash, file I/O, etc.).
When a UI element is first detected, it gets a dual representation: a cropped visual template (for fast matching) and a VLM-assigned label (for reasoning). Stored per-app, reused across all future sessions.
memory/
├── linux/ # Platform-specific memory
│ └── apps/
│ ├── desktop/ # General desktop components
│ ├── chromium/ # Browser UI
│ │ └── sites/ # Per-website memory
│ ├── gimp/
│ └── libreoffice-calc/
│ ├── components.json # Component registry
│ ├── states.json # UI states (component sets)
│ ├── transitions.json # State graph edges
│ └── components/ # Template images
Activity-based forgetting — Components track consecutive misses. After 15 misses, auto-removed. Keeps memory aligned with the app's current UI.
State matching — States are sets of visible components, matched by Jaccard similarity (>0.7 = same state, >0.85 = auto-merge).
| Detector | Speed | Finds |
|---|---|---|
| GPA-GUI-Detector | ~0.3s | Icons, buttons, input fields |
| Apple Vision OCR / EasyOCR | ~1.6s | Text elements |
| Template Match | ~0.3s | Known components (after first detection) |
GUI Agent Harness is built on OpenProgram — the reference implementation of the Agentic Programming paradigm, where Python functions with LLM-powered docstrings become autonomous agents. Each function (verify_step, plan_next_action, general_action) is an @agentic_function that calls the LLM exactly once and returns structured data.
from openprogram import agentic_function
@agentic_function(summarize={"siblings": -1})
def plan_next_action(task, img_path, ..., runtime=None) -> dict:
"""Decide the next action to take toward completing the task.
You are a GUI automation agent. Choose one action to execute next.
...
"""
reply = runtime.exec(content=[
{"type": "text", "text": context},
{"type": "image", "path": img_path},
])
return parse_json(reply)The docstring IS the prompt. The function signature defines the interface. The framework handles context management, history summarization, and provider abstraction.
Naming: Agentic Programming is the paradigm (the philosophy — decorator + context tree + meta functions). OpenProgram is the product (the Python package that ships the runtime). The
@agentic_functiondecorator keeps the paradigm name as a visible badge of lineage.
| Priority | Provider | Cost | Notes |
|---|---|---|---|
| 1 | OpenClaw | Subscription | Auto-detected if openclaw CLI exists |
| 2 | Claude Code CLI | Subscription | Auto-detected if claude CLI exists |
| 3 | Anthropic API | Per-token | Requires ANTHROPIC_API_KEY |
| 4 | OpenAI API | Per-token | Requires OPENAI_API_KEY |
Override with --provider and --model flags.
GUI-Agent-Harness/
├── gui_harness/
│ ├── main.py # CLI entry point + gui_agent loop
│ ├── runtime.py # LLM provider auto-detection
│ ├── tasks/
│ │ └── execute_task.py # 4-phase step: observe → verify → plan → dispatch
│ ├── action/
│ │ ├── input.py # Mouse/keyboard primitives
│ │ └── general_action.py # Free-form LLM action with tool access
│ ├── perception/
│ │ └── screenshot.py # Screenshot capture (local + VM)
│ ├── planning/
│ │ ├── component_memory.py # Template matching + state management
│ │ └── learn.py # First-time app component learning
│ ├── memory/ # Memory management utilities
│ └── adapters/
│ └── vm_adapter.py # Redirect all I/O to remote VM
├── libs/
│ └── agentic-programming/ # OpenProgram runtime, pinned as git submodule
│ # (legacy name kept for compat during upstream rename)
├── benchmarks/
│ └── osworld/ # OSWorld benchmark runner + results
├── memory/ # Visual memory storage (per-platform, per-app)
├── SKILL.md # LLM skill definition for gui-agent
└── pyproject.toml
- Python 3.12+
- macOS (Apple Silicon recommended for Vision OCR) or Linux
- At least one LLM provider (Claude Code CLI, OpenClaw, or API key)
- For VM automation: OSWorld or compatible HTTP API
MIT — see LICENSE for details.
@misc{fu2026gui-agent-harness,
author = {Fu, Zichuan},
title = {GUI Agent Harness: Autonomous GUI Automation with Visual Memory},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Fzkuji/GUI-Agent-Harness},
}Built with OpenProgram — the Agentic Programming paradigm, productized