Mobile-Agent: The Powerful GUI Agent Family
-
Updated
Apr 14, 2026 - Python
Mobile-Agent: The Powerful GUI Agent Family
[EMNLP-2024] Build multimodal language agents for fast prototype and production
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
Developer-first Python framework for AI agents with built-in budget control, context, memory and observability.
🖼️ Workshop: Build a multimodal AI agent with Haystack & GPT-4o — featuring image understanding, document retrieval, conversational memory, and human-in-the-loop safety controls
Claude Code in Docker. Drop-in OpenAI-compatible API, MCP server, Telegram bot, and CLI — five interfaces, one image. Persistent sessions, file ops, always-on skill injection, and a full dev toolchain (Go, Python, Node, K8s, Terraform, databases) or a minimal image with just the basics.
[COLM 2024] ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
A persistent, emotionally reactive 3D Digital Persona powered by Gemini 2.5 Flash Native Audio. Features sub-100ms conversational latency and procedural ARKit emotive realism.
🧮 Multi-agent AI math tutor built with LangGraph — CRAG retrieval, episodic & semantic long-term memory, Tavily MCP web search, Google OAuth, and Neo4j-style memory graph. Powered by LLaMA 3.3 70B on Groq.
multimodal coding assistant that can analyze images containing code problems and generate solutions in multiple programming languages.
Build an end-to-end system that ingests inventory report PDFs/images, runs OCR to normalize and extract tabular data, stores the cleaned dataset, and exposes a secure, conversational agent that can answer business queries over the data (aggregation, filtering, joins, trends), returning tables, charts, and exportable results.
DoodleSoul is a multimodal AI agent for special education. It brings children's drawings to life via Gemini Live API and weaves real-time voice, Imagen 4 illustrations, and Veo 3 videos into therapeutic Social Stories using interleaved output.
Build an end-to-end system that ingests inventory report PDFs/images, runs OCR to normalize and extract tabular data, stores the cleaned dataset, and exposes a secure, conversational agent that can answer business queries over the data (aggregation, filtering, joins, trends), returning tables, charts, and exportable results.
Add a description, image, and links to the multimodal-agent topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-agent topic, visit your repo's landing page and select "manage topics."