GitHub - allenai/MolmoPoint-GUISyn: Synthetic GUI Pointing Data Generation

Synthetic grounding data generation pipeline for training GUI agents. Given a text context description (e.g., "MacOS system, using VSCode") or a reference screenshot image, the pipeline uses an LLM to generate self-contained HTML, renders it as a screenshot, automatically extracts bounding boxes for all visible UI elements, and then annotates each element with semantic names and interaction intents.

Installation

conda create --name guisyn python=3.11
conda activate guisyn
pip install -r requirements.txt
playwright install chromium

Then export your API keys:

export OPENAI_API_KEY=your-key      # if using GPT models
export ANTHROPIC_API_KEY=your-key   # if using Claude models

Pipeline Overview

The pipeline has two stages:

Generation (generate.py): LLM generates HTML from a context description or reference image -> Playwright renders it to a screenshot -> bounding boxes are extracted for every visible element.
Annotation (annotate.py): An LLM annotates each bounding box with a human-readable name and 5 diverse interaction intents (e.g., "Click the Bold button to bold the selected text").

Quick Start

Generate a single example from text

python generate.py --context "MacOS system, using VSCode with a Python project open"

This creates a directory under ./output/data/<example_id>/ containing:

<id>.png — rendered screenshot
<id>_bbox.png — screenshot with bounding box overlays
<id>_bboxes.json — extracted bounding boxes
page.html — source HTML
metadata.json — generation metadata

Generate from a reference screenshot

python generate.py --reference_image path/to/screenshot.png

The model reproduces the reference image's appearance as HTML and extracts bounding boxes.

Annotate an example

python annotate.py --mode demo --example_dir ./output/data/grounding-xxx

This adds annotated_bboxes.json to the example directory with semantic names and interaction intents for each UI element.

Batch Generation (Large Scale)

For generating data at scale, the pipeline supports the OpenAI and Anthropic batch APIs.

Step 1: Prepare batch prompts

Create a text file with one context description per line:

MacOS system, using VSCode with a Python project open
Windows 11 desktop, Chrome browser showing Gmail
Ubuntu terminal running htop with high CPU usage

Then generate and upload the batch:

# Text-based batch
python generate.py --mode batch_text \
    --batch_input_file contexts.txt \
    --upload_batch

# Image-based batch (one image path per line)
python generate.py --mode batch_image \
    --batch_input_file images.txt \
    --upload_batch

Step 2: Process batch results

After the batch job completes, download the results JSONL and process it:

python generate.py --mode batch_process \
    --batch_response_file results.jsonl

Step 3: Batch annotation

# Auto-discover all unannotated examples and submit
python annotate.py --mode batch --discover --upload_batch

# Process annotation results
python annotate.py --mode batch_process \
    --batch_response_file anno_results.jsonl

Visualizer

A browser-based visualizer (visualize.html) is included for inspecting generated data. It overlays bounding boxes on screenshots and shows element names, intents, and HTML snippets on hover.

Option 1: Open a folder directly (no server needed) — click "Open Folder" and select an example directory.

Option 2: Serve via HTTP — run from the repo root:

python -m http.server 8080

Then open http://localhost:8080/visualize.html and enter an example ID to load it.

Key Arguments

generate.py

Argument	Default	Description
`--model_name`	`claude-sonnet-4-6`	LLM to use (must contain `gpt` or `claude`)
`--output_dir`	`./output`	Root output directory
`--context`	—	Text context for single generation
`--reference_image`	—	Path to reference screenshot
`--mode`	`demo`	`demo`, `batch_text`, `batch_image`, or `batch_process`
`--only_leaf_boxes`	`False`	Only keep leaf bounding boxes (filter parents)
`--seed`	`42`	Random seed
`--persona_cache`	—	JSON file to track used personas (prevents repeats)

annotate.py

Argument	Default	Description
`--model_name`	`claude-sonnet-4-6`	LLM to use
`--output_dir`	`./output`	Root output directory
`--mode`	`demo`	`demo`, `batch`, or `batch_process`
`--example_dir`	—	Full path to example directory (demo mode)
`--discover`	`False`	Auto-discover unannotated examples (batch mode)

Project Structure

MolmoPoint-GUISyn/
├── generate.py                     # CLI: screenshot generation
├── annotate.py                     # CLI: bbox annotation
├── visualize.html                  # Browser-based data visualizer
├── pipeline/
│   ├── __init__.py
│   ├── grounding_generator.py      # GroundingGenerator class
│   ├── grounding_annotator.py      # GroundingAnnotator class
│   ├── bbox_extractor.py           # HTML -> bounding box extraction (Playwright)
│   └── llm_utils.py                # LLM API helpers (OpenAI & Anthropic)
├── data/
│   └── persona.jsonl               # Persona descriptions for diverse generation
├── requirements.txt
└── README.md

Citation

If you use this codebase or our datasets in your work, please cite:

@article{clark2026molmopoint,
  title={MolmoPoint: Better Pointing for VLMs with Grounding Tokens},
  author={Clark, Christopher and Yang, Yue and Park, Jae Sung and Ma, Zixian and Zhang, Jieyu and Tripathi, Rohun and Salehi, Mohammadreza and Lee, Sangho and Anderson, Taira and Han, Winson and others},
  journal={arXiv preprint arXiv:2603.28069},
  year={2026}
}

@article{yang2025scaling,
      title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation},
      author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and others},
      journal={arXiv preprint arXiv:2502.14846},
      year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Pipeline Overview

Quick Start

Generate a single example from text

Generate from a reference screenshot

Annotate an example

Batch Generation (Large Scale)

Step 1: Prepare batch prompts

Step 2: Process batch results

Step 3: Batch annotation

Visualizer

Key Arguments

generate.py

annotate.py

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annotate.py		annotate.py
contexts.txt		contexts.txt
generate.py		generate.py
requirements.txt		requirements.txt
visualize.html		visualize.html

Folders and files

Latest commit

History

Repository files navigation

Installation

Pipeline Overview

Quick Start

Generate a single example from text

Generate from a reference screenshot

Annotate an example

Batch Generation (Large Scale)

Step 1: Prepare batch prompts

Step 2: Process batch results

Step 3: Batch annotation

Visualizer

Key Arguments

generate.py

annotate.py

Project Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages