GitHub - Fzkuji/GUI-Agent-Harness: Autonomous GUI agent — give it a task, it operates the desktop. Visual memory, one-shot UI learning. | 自主GUI代理——给它一个任务，它操作桌面。视觉记忆，一次学习即可操作。

Autonomous GUI agent — give it a task, it operates the desktop.
_{Visual memory • One-shot UI learning • Any LLM provider • Local or VM}

🇺🇸 English · 🇨🇳 中文

News

[2026-04-14] 🏆 OSWorld Multi-Apps 79.8% — 72.6/91 evaluated tasks. 4-phase step loop + CLI session persistence + PRESERVE FORMAT work habit. Results →
[2026-04-18] 📦 OpenProgram — Agentic Programming graduated from concept to product: repo/package/CLI renamed to OpenProgram. Agentic Programming remains the paradigm name; OpenProgram is the shippable framework. Harness imports migrated to from openprogram import ....
[2026-04-07] 🤖 Agent-native architecture — Rebuilt execution core on the Agentic Programming paradigm, unifying GUI perception and free-form agent actions under a single decision loop. Eliminates task-specific scripting.
[2026-03-30] 📐 ImageContext coordinate system — Replaced dual-space model with ImageContext class; scale-independent cropping, fixes crop bugs on non-fullscreen images.
[2026-03-29] 🎬 v0.3 — Unified Actions & Cross-Platform GUI — gui_action.py as single entry point. Platform backends auto-selected via --remote.
[2026-03-23] 🏆 OSWorld Chrome 93.5% — One attempt (43/46), 97.8% two attempts (45/46). Results →
[2026-03-23] 🔄 Memory overhaul — Split storage, automatic component forgetting, state merging by Jaccard similarity.
[2026-03-10] 🚀 Initial release — GPA-GUI-Detector + Apple Vision OCR + template matching + per-app visual memory.

What is GUI Agent Harness?

A CLI tool that turns any LLM into a GUI automation agent. You give it a natural-language task, it operates the desktop autonomously — screenshots, clicks, types, verifies, and repeats until the task is done.

gui-agent "Install the Orchis GNOME theme"
gui-agent --vm http://172.16.82.132:5000 "Open GitHub in Chrome and Python docs"

Designed as an LLM tool. The intended workflow is:

An LLM (Claude Code, OpenClaw, etc.) receives a GUI task from the user
The LLM's skill/prompt tells it to call gui-agent as a CLI tool
gui-agent handles all GUI perception and interaction internally
The LLM gets back a result summary

The LLM doesn't need to know how GUI automation works — it just calls the tool.

Key Ideas

Visual memory — UI components are detected once, labeled by a VLM, and stored as templates. On subsequent encounters, template matching replaces expensive re-detection (~5x faster, ~60x fewer tokens).
State transitions — The UI is modeled as a graph of states (sets of visible components). Successful action sequences are recorded as transitions for future replay.
4-phase step loop — Each step follows: Observe (screenshot + detect) → Verify (check previous action) → Plan (LLM decides next action) → Dispatch (execute). All phases are @agentic_function calls with structured feedback between steps.
Provider-agnostic — Works with Claude Code CLI, OpenClaw, Anthropic API, or OpenAI API. Auto-detects the best available provider.

OSWorld Results

Multi-Apps domain: 79.8% (72.6/91 evaluated tasks)

Metric	Value
Total tasks	101
Evaluated	91
Blocked (no credentials)	10
Passed (score = 1.0)	63
Partial (0 < score < 1.0)	11

Full results: benchmarks/osworld/multi_apps.md

Quick Start

Step 1: Install GUI Agent Harness

pip install git+https://github.com/Fzkuji/GUI-Agent-Harness.git

All dependencies are installed automatically, including OpenProgram (the Agentic Programming runtime), ultralytics (GPA-GUI-Detector), OpenCV, Pillow, etc.

Local development: the upstream PyPI package name is openprogram. If you're building against an unreleased branch, install OpenProgram first (pip install -e /path/to/OpenProgram) and then pip install -e . --no-deps inside this repo to avoid the git-URL fetch.

For development (editable install):

git clone https://github.com/Fzkuji/GUI-Agent-Harness.git
cd GUI-Agent-Harness
pip install -e .

Step 2: Set up an LLM provider

GUI Agent Harness needs an LLM to make decisions. Install at least one provider:

Option A: Claude Code CLI (recommended)

npm install -g @anthropic-ai/claude-code
claude login

Uses your Claude subscription — no per-token cost. The agent runs as claude -p under the hood.

Option B: Anthropic API

export ANTHROPIC_API_KEY=sk-ant-...

Pay-per-token. Set the key in your shell profile for persistence.

Option C: OpenAI API

export OPENAI_API_KEY=sk-...

The system auto-detects the best available provider. You can also force one with --provider.

Step 3: Platform setup

macOS:

Grant accessibility permissions: System Settings → Privacy & Security → Accessibility → add your Terminal app
Apple Vision OCR works automatically (no extra install)

Linux:

Install EasyOCR for text detection: pip install easyocr
Or install with: pip install "gui-agent-harness[ocr] @ git+https://github.com/Fzkuji/GUI-Agent-Harness.git"

Step 4: Run

# Local desktop
gui-agent "Open Firefox and go to google.com"

# Remote VM (e.g., OSWorld)
gui-agent --vm http://VM_IP:5000 "Install the Orchis GNOME theme"

# Specify provider and model
gui-agent --provider claude-code --model opus "Send hello in WeChat"

Use as LLM skill

GUI Agent Harness is designed to be called by an LLM as a tool. After pip install, register the project as a skill so your LLM can discover and use it.

LLM skill systems typically scan a skills directory for subdirectories containing a SKILL.md file. To register GUI Agent Harness, copy or symlink the project into your LLM's skills directory:

# Example: copy into OpenClaw's skills directory
cp -r GUI-Agent-Harness ~/.openclaw/skills/gui-agent

# Or symlink (recommended — stays in sync with git)
ln -s /path/to/GUI-Agent-Harness ~/.openclaw/skills/gui-agent

Claude Code auto-discovers SKILL.md from the current working directory or configured skill paths:

# Option 1: work from the project directory (auto-discovered)
cd /path/to/GUI-Agent-Harness

# Option 2: add to Claude Code's skill search paths
claude config set skillPaths '["<path-to-GUI-Agent-Harness>"]'

Once registered, the LLM reads SKILL.md and knows when and how to call gui-agent — no further configuration needed.

CLI Options

gui-agent [OPTIONS] TASK

Arguments:
  TASK                  Natural language task description

Options:
  --vm URL              Remote VM HTTP API (e.g., http://172.16.82.132:5000)
  --provider NAME       Force LLM provider: claude-code, openclaw, anthropic, openai
  --model NAME          Override model name (e.g., opus, sonnet, gpt-4o)
  --max-steps N         Max actions before stopping (default: 15)
  --app NAME            App name for component memory (default: desktop)

Architecture

gui-agent "task description"
    │
    ▼
gui_agent()                    ← @agentic_function, drives the loop
    │
    ├── for step in 1..max_steps:
    │       │
    │       ▼
    │   gui_step()             ← @agentic_function, orchestration
    │       │
    │       ├── 1. Observe     (Python) — screenshot + detect + match + state ID
    │       ├── 2. Verify      (LLM)   — check previous action's result
    │       ├── 3. Plan        (LLM)   — decide next action
    │       └── 4. Dispatch    (Python) — execute: click/type/scroll/general
    │       │
    │       ▼
    │   build_step_feedback()  ← structured result → next iteration
    │
    └── return result summary

Observe — Pure Python. Takes a screenshot, runs GPA-GUI-Detector + OCR, matches against stored component templates, identifies the current UI state.

Verify — LLM call. Examines the screenshot after the previous action. Reports whether the action succeeded. Does not decide task completion.

Plan — LLM call. Sees the screenshot, detected components, verification result, and known state transitions. Chooses one action (click, type, scroll, general, done).

Dispatch — Pure Python. Executes the planned action. For clicks, uses template matching to find precise coordinates. For general, delegates to the LLM with full tool access (Bash, file I/O, etc.).

Visual Memory

When a UI element is first detected, it gets a dual representation: a cropped visual template (for fast matching) and a VLM-assigned label (for reasoning). Stored per-app, reused across all future sessions.

memory/
├── linux/                     # Platform-specific memory
│   └── apps/
│       ├── desktop/           # General desktop components
│       ├── chromium/          # Browser UI
│       │   └── sites/         # Per-website memory
│       ├── gimp/
│       └── libreoffice-calc/
│           ├── components.json    # Component registry
│           ├── states.json        # UI states (component sets)
│           ├── transitions.json   # State graph edges
│           └── components/        # Template images

Activity-based forgetting — Components track consecutive misses. After 15 misses, auto-removed. Keeps memory aligned with the app's current UI.

State matching — States are sets of visible components, matched by Jaccard similarity (>0.7 = same state, >0.85 = auto-merge).

Detection Stack

Detector	Speed	Finds
GPA-GUI-Detector	~0.3s	Icons, buttons, input fields
Apple Vision OCR / EasyOCR	~1.6s	Text elements
Template Match	~0.3s	Known components (after first detection)

Built on OpenProgram

GUI Agent Harness is built on OpenProgram — the reference implementation of the Agentic Programming paradigm, where Python functions with LLM-powered docstrings become autonomous agents. Each function (verify_step, plan_next_action, general_action) is an @agentic_function that calls the LLM exactly once and returns structured data.

from openprogram import agentic_function

@agentic_function(summarize={"siblings": -1})
def plan_next_action(task, img_path, ..., runtime=None) -> dict:
    """Decide the next action to take toward completing the task.

    You are a GUI automation agent. Choose one action to execute next.
    ...
    """
    reply = runtime.exec(content=[
        {"type": "text", "text": context},
        {"type": "image", "path": img_path},
    ])
    return parse_json(reply)

The docstring IS the prompt. The function signature defines the interface. The framework handles context management, history summarization, and provider abstraction.

Naming: Agentic Programming is the paradigm (the philosophy — decorator + context tree + meta functions). OpenProgram is the product (the Python package that ships the runtime). The @agentic_function decorator keeps the paradigm name as a visible badge of lineage.

LLM Provider Priority

Priority	Provider	Cost	Notes
1	OpenClaw	Subscription	Auto-detected if `openclaw` CLI exists
2	Claude Code CLI	Subscription	Auto-detected if `claude` CLI exists
3	Anthropic API	Per-token	Requires `ANTHROPIC_API_KEY`
4	OpenAI API	Per-token	Requires `OPENAI_API_KEY`

Override with --provider and --model flags.

Project Structure

GUI-Agent-Harness/
├── gui_harness/
│   ├── main.py                # CLI entry point + gui_agent loop
│   ├── runtime.py             # LLM provider auto-detection
│   ├── tasks/
│   │   └── execute_task.py    # 4-phase step: observe → verify → plan → dispatch
│   ├── action/
│   │   ├── input.py           # Mouse/keyboard primitives
│   │   └── general_action.py  # Free-form LLM action with tool access
│   ├── perception/
│   │   └── screenshot.py      # Screenshot capture (local + VM)
│   ├── planning/
│   │   ├── component_memory.py  # Template matching + state management
│   │   └── learn.py           # First-time app component learning
│   ├── memory/                # Memory management utilities
│   └── adapters/
│       └── vm_adapter.py      # Redirect all I/O to remote VM
├── libs/
│   └── agentic-programming/   # OpenProgram runtime, pinned as git submodule
│                              # (legacy name kept for compat during upstream rename)
├── benchmarks/
│   └── osworld/               # OSWorld benchmark runner + results
├── memory/                    # Visual memory storage (per-platform, per-app)
├── SKILL.md                   # LLM skill definition for gui-agent
└── pyproject.toml

Requirements

Python 3.12+
macOS (Apple Silicon recommended for Vision OCR) or Linux
At least one LLM provider (Claude Code CLI, OpenClaw, or API key)
For VM automation: OSWorld or compatible HTTP API

License

MIT — see LICENSE for details.

Citation

@misc{fu2026gui-agent-harness,
  author       = {Fu, Zichuan},
  title        = {GUI Agent Harness: Autonomous GUI Automation with Visual Memory},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Fzkuji/GUI-Agent-Harness},
}

_{Built with OpenProgram — the Agentic Programming paradigm, productized}

Name		Name	Last commit message	Last commit date
Latest commit History 463 Commits
actions		actions
assets		assets
benchmarks/osworld		benchmarks/osworld
desktop_env		desktop_env
docs		docs
gui_agent_harness.egg-info		gui_agent_harness.egg-info
gui_harness		gui_harness
libs		libs
memory/apps		memory/apps
platforms		platforms
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

What is GUI Agent Harness?

Key Ideas

OSWorld Results

Quick Start

Step 1: Install GUI Agent Harness

Step 2: Set up an LLM provider

Step 3: Platform setup

Step 4: Run

Use as LLM skill

CLI Options

Architecture

Visual Memory

Detection Stack

Built on OpenProgram

LLM Provider Priority

Project Structure

Requirements

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News

What is GUI Agent Harness?

Key Ideas

OSWorld Results

Quick Start

Step 1: Install GUI Agent Harness

Step 2: Set up an LLM provider

Step 3: Platform setup

Step 4: Run

Use as LLM skill

CLI Options

Architecture

Visual Memory

Detection Stack

Built on OpenProgram

LLM Provider Priority

Project Structure

Requirements

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages