π€ Desktop Automation for Apps Without APIs
Give Claude hands and eyes β automate any desktop app, even if it has no API
automate.mp4
autoMate is an MCP server that gives AI assistants (Claude, GPT, etc.) the ability to control any desktop application β even apps with no API, no plugin system, and no automation support.
What makes it different from filesystem / browser / Windows MCP:
| MCP Server | What it automates |
|---|---|
| filesystem MCP | Files and folders |
| browser MCP | Web pages |
| Windows MCP | OS settings and system calls |
| autoMate | Any desktop GUI app with no API β εͺζ , Photoshop, AutoCAD, WeChat, SAP, internal toolsβ¦ |
Two modes:
| Mode | How it works | Requires |
|---|---|---|
| Basic | Claude sees the screen, autoMate clicks/types | Nothing β zero config |
| Cloud Vision | autoMate parses UI itself + reasons via cloud VLM | HuggingFace token + endpoints |
- π₯οΈ Automates apps with no API β if it has a GUI, autoMate can drive it
- π Reusable script library β save workflows once, run forever; install community scripts in one command
- βοΈ Cloud Vision β screen parsing via OmniParser + action reasoning via UI-TARS, all in the cloud, zero local GPU
- π§ Claude knows when to use it β clear identity prevents autoMate from being bypassed by other MCPs
- π€ Zero config for basic use β no API keys, no env vars needed to get started
- π Cross-platform β Windows, macOS, Linux
Prerequisite:
pip install uv
Open Settings β Developer β Edit Config, then add:
{
"mcpServers": {
"automate": {
"command": "uvx",
"args": ["automate-mcp@latest"]
}
}
}Restart Claude Desktop β done. @latest keeps autoMate up to date automatically.
Edit ~/.openclaw/openclaw.json:
{
"mcpServers": {
"automate": {
"command": "uvx",
"args": ["automate-mcp@latest"]
}
}
}openclaw gateway restartSettings β MCP Servers β Add:
{
"automate": {
"command": "uvx",
"args": ["automate-mcp@latest"]
}
}Cloud Vision adds autonomous screen parsing and action reasoning to autoMate β no local GPU required.
It uses two HuggingFace Inference Endpoints:
- OmniParser V2 β detects all UI elements (icons, buttons, text) from a screenshot
- UI-TARS / Qwen-VL β vision-language model that decides what action to take next
Add these env vars to your MCP config:
{
"mcpServers": {
"automate": {
"command": "uvx",
"args": ["automate-mcp@latest"],
"env": {
"AUTOMATE_HF_TOKEN": "hf_...",
"AUTOMATE_SCREEN_PARSER_URL": "https://your-omniparser-endpoint.aws.endpoints.huggingface.cloud",
"AUTOMATE_ACTION_MODEL_URL": "https://your-uitars-endpoint.aws.endpoints.huggingface.cloud",
"AUTOMATE_ACTION_MODEL_NAME": "ByteDance-Seed/UI-TARS-1.5-7B",
"AUTOMATE_HF_NAMESPACE": "your-hf-username",
"AUTOMATE_SCREEN_PARSER_ENDPOINT": "omniparser-v2",
"AUTOMATE_ACTION_MODEL_ENDPOINT": "ui-tars-1-5-7b"
}
}
}
}See .env.example in the repo for the full reference.
1. warm_endpoints β wake up scaled-to-zero endpoints (1β5 min)
2. parse_screen β detect all UI elements via cloud OmniParser
3. reason_action β ask VLM what to click/type next
β or β
smart_act β full autonomous loop: parse β reason β execute β repeat
Script library β save once, run forever:
| Tool | Description |
|---|---|
list_scripts |
Show all saved automation scripts |
run_script |
Run a saved script by name |
save_script |
Save the current workflow as a reusable script |
show_script |
View a script's contents |
delete_script |
Delete a script |
install_script |
Install a script from a URL or the community library |
Cloud Vision β autonomous UI understanding (requires HF config):
| Tool | Description |
|---|---|
cloud_vision_config |
Show current cloud vision configuration status |
warm_endpoints |
Wake up scaled-to-zero HF endpoints before use |
parse_screen |
Detect all UI elements via cloud OmniParser |
reason_action |
Ask a VLM what GUI action to take next |
smart_act |
Full autonomous loop: parse β reason β execute β repeat |
Low-level desktop control β used when building or executing scripts:
| Tool | Description |
|---|---|
screenshot |
Capture the screen and return as base64 PNG |
click |
Click at screen coordinates |
double_click |
Double-click at screen coordinates |
type_text |
Type text (full Unicode / CJK support) |
press_key |
Press a key or combo (e.g. ctrl+c, win) |
scroll |
Scroll up or down |
mouse_move |
Move cursor without clicking |
drag |
Drag from one position to another |
Scripts are saved as .md files in ~/.automate/scripts/ β human-readable, git-friendly, shareable.
---
name: jianying_export_douyin
description: Export the current εͺζ project as a 9:16 Douyin video
created: 2025-01-01
---
## Steps
1. Open export dialog [key:ctrl+e]
2. Select resolution 1080Γ1920 [click:coord=320,480]
3. Set format to MP4 [click:coord=320,560]
4. Click export [click:coord=800,650]
5. Wait for export to finish [wait:5]Inline hint syntax:
| Hint | Action |
|---|---|
[click:coord=320,240] |
Click at absolute screen coordinates |
[type:hello] |
Type text |
[key:ctrl+s] |
Press keyboard shortcut |
[wait:2] |
Wait 2 seconds |
[scroll_up] / [scroll_down] |
Scroll the page |
Steps without hints are interpreted by the AI vision model at runtime.
Q: How is this different from just using Claude's computer-use capability?
autoMate provides persistent, reusable scripts. Once you automate a task, it's saved and runs instantly next time. Cloud Vision mode also lets autoMate do its own screen parsing without relying on Claude's vision.
Q: Why does Claude sometimes use Windows MCP / filesystem MCP instead of autoMate?
Update to v0.4.0+ β the server description now explicitly tells Claude when to use autoMate vs other MCPs.
Q: Do I need a GPU for Cloud Vision?
No β everything runs on HuggingFace Inference Endpoints in the cloud. You only need a HF token and deployed endpoints.
Q: Does it work on macOS / Linux?
Yes β all three platforms. This is the main advantage over Quicker (Windows-only).