Build software better, together

X-PLUG / MobileAgent

Mobile-Agent: The Powerful GUI Agent Family

android agent app gui automation mobile copilot multimodal mobile-agents mllm multimodal-large-language-models multimodal-agent

Updated Apr 14, 2026
Python

om-ai-lab / OmAgent

Star

[EMNLP-2024] Build multimodal language agents for fast prototype and production

Updated Mar 19, 2025
Python

bz-lab / AUITestAgent

Star

AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.

testing agent gui automation mobile-app multi-agent multimodal llm mllm multimodal-agent gpt-4o

Updated Jul 19, 2024

syrin-labs / syrin-python

Star

Developer-first Python framework for AI agents with built-in budget control, context, memory and observability.

ai budgeting multi-agent memory-management observability rag ai-agent multimodal-agent agentic-rag context-engineering model-routing harness-engineering

Updated Apr 11, 2026
Python

bilgeyucel / multimodal-agent-workshop

Star

🖼️ Workshop: Build a multimodal AI agent with Haystack & GPT-4o — featuring image understanding, document retrieval, conversational memory, and human-in-the-loop safety controls

memory haystack human-in-the-loop multimodal ai-agent short-term-memory haystack-ai multimodal-agent

Updated Jan 30, 2026
Jupyter Notebook

Claude Code in Docker. Drop-in OpenAI-compatible API, MCP server, Telegram bot, and CLI — five interfaces, one image. Persistent sessions, file ops, always-on skill injection, and a full dev toolchain (Go, Python, Node, K8s, Terraform, databases) or a minimal image with just the basics.

api docker wrapper ai telegram docker-image container web-api programmatic claude ai-agent multimodal-agent claude-code code-agent development-agent

Updated Apr 16, 2026
Shell

bigai-nlco / ExoViP

Star

[COLM 2024] ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

agent verification multimodal llm multimodal-agent compositional-reasoning

Updated Oct 1, 2024
Python

AbhyudayPatel / DuoTalk

Star

Multi-Agent voice conversation platform

multimodal-agent

Updated Oct 3, 2025
Python

Kshitijm7 / digital-persona

Star

A persistent, emotionally reactive 3D Digital Persona powered by Gemini 2.5 Flash Native Audio. Features sub-100ms conversational latency and procedural ARKit emotive realism.

Updated Mar 26, 2026
TypeScript

dikshant182004 / MathTutor

Star

🧮 Multi-agent AI math tutor built with LangGraph — CRAG retrieval, episodic & semantic long-term memory, Tavily MCP web search, Google OAuth, and Neo4j-style memory graph. Powered by LLaMA 3.3 70B on Groq.

python redis ai graph educational google-vision-api agents google-auth multimodal streamlit vector-database crag langchain react-agent multimodal-agent ai-tutor agentic-ai langraph tavily-mcp

Updated Apr 6, 2026
Python

hari7261 / MultimodalCodingAgent-AI

Star

multimodal coding assistant that can analyze images containing code problems and generate solutions in multiple programming languages.

code-generation gradio bug-fixing huggingface ai-agent multimodal-agent image-to-code hari7261 coding-agent

Updated Sep 3, 2025
Python

teibok / InventoryQA_OCR_GenAI

Star

Build an end-to-end system that ingests inventory report PDFs/images, runs OCR to normalize and extract tabular data, stores the cleaned dataset, and exposes a secure, conversational agent that can answer business queries over the data (aggregation, filtering, joins, trends), returning tables, charts, and exportable results.

pandas ocr-recognition genai multimodal-agent agentic-ai

Updated Dec 3, 2025
Python

matheus896 / DoodleSoul

Star

DoodleSoul is a multimodal AI agent for special education. It brings children's drawings to life via Gemini Live API and weaves real-time voice, Imagen 4 illustrations, and Veo 3 videos into therapeutic Social Stories using interleaved output.

google-cloud multimodal-agent gemini-live-api interleaved-output

Updated Mar 16, 2026
Python

teibokchyne / InventoryQA_OCR_GenAI

Star

Build an end-to-end system that ingests inventory report PDFs/images, runs OCR to normalize and extract tabular data, stores the cleaned dataset, and exposes a secure, conversational agent that can answer business queries over the data (aggregation, filtering, joins, trends), returning tables, charts, and exportable results.

docker pandas-dataframe flask-application flask-sqlalchemy ocr-recognition genai multimodal-agent agentic-ai

Updated Dec 5, 2025
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multimodal-agent

Here are 14 public repositories matching this topic...

X-PLUG / MobileAgent

om-ai-lab / OmAgent

bz-lab / AUITestAgent

syrin-labs / syrin-python

bilgeyucel / multimodal-agent-workshop

psyb0t / docker-claudebox

bigai-nlco / ExoViP

AbhyudayPatel / DuoTalk

Kshitijm7 / digital-persona

dikshant182004 / MathTutor

hari7261 / MultimodalCodingAgent-AI

teibok / InventoryQA_OCR_GenAI

matheus896 / DoodleSoul

teibokchyne / InventoryQA_OCR_GenAI

Improve this page

Add this topic to your repo