Awesome Vision AI Stack

A curated, builder-first list of Vision Language Models (VLMs), local runtimes, document AI tools, UI agents, robotics vision stacks, datasets, benchmarks, and production resources.

The goal is not to collect every paper. The goal is to help developers build with vision.

Why this list exists

Most multimodal lists are strong on papers but weak on execution.

This one is optimized for builders who want to answer questions like:

Which VLMs can I run locally?
Which models are good for documents, charts, OCR, and screenshots?
What should I use for UI agents and computer-use workflows?
Which stacks help with robotics, grounding, and action?
How do I benchmark, fine-tune, and ship a multimodal product?

Foundation VLMs

General-purpose open models

LLaVA - One of the most influential open visual instruction-tuned models.
Qwen2-VL - Strong open multimodal family for image, document, and video understanding.
Qwen-VL - Earlier Qwen multimodal line with broad ecosystem support.
InternVL - Strong family of open large vision-language models.
CogVLM - Open visual language model family from THUDM.
MiniGPT-4 - Early and influential image-chat system.
InstructBLIP - Instruction-tuned extension of BLIP-style architectures.
BLIP-2 - Efficient VLM architecture connecting frozen vision and language models.
IDEFICS - Hugging Face open multimodal family.
DeepSeek-VL - Open multimodal reasoning models from DeepSeek.
Molmo - Open multimodal assistant from Ai2 with strong grounding focus.
Phi-3 Vision - Compact multimodal model useful for practical deployments.
Fuyu - Multimodal autoregressive model with a distinct design.
Gemma Vision - Google's open multimodal model; efficient vision-language understanding with strong performance on document and image reasoning.
SmolVLM - Small multimodal models for lightweight use cases.
Moondream - Lightweight open-source VLM optimized for efficiency and local deployment.
MedGEMMA - Medical-focused vision language model from Google for healthcare applications.

Research landmarks

Flamingo - Landmark few-shot visual language model.
Kosmos-1 - Early multimodal reasoning and grounding work.
PaLI - Scalable multilingual vision-language model.
PaLI-X - Larger multimodal extension of PaLI.
Kosmos-2 - Grounded multimodal large language model.
SEED-Bench ecosystem - Useful benchmark family around multimodal reasoning.

Models people actually test in products

LLaVA-style models for rapid prototyping
Qwen2-VL family for stronger general multimodal performance
InternVL for competitive open performance
Phi-3 Vision and SmolVLM for lighter deployments
Molmo for grounding-heavy exploration

Local Inference and Serving

Run locally

Ollama - Local model runtime with growing multimodal support.
LM Studio - Desktop app for running local models with a friendly UI.
Jan - Open local AI runtime and desktop app.
llama.cpp - Core local inference stack; important for lightweight experimentation.
MLC-LLM - Compile and deploy models on edge and mobile devices.
OpenVINO - Useful for Intel-optimized deployments.

Run in browser (zero-setup inference)

WebLLM - Run VLMs natively in the browser without backend server.
Transformers.js - Hugging Face models (vision and audio) in JavaScript; enables client-side multimodal inference.
ONNX Runtime Web - Cross-platform model inference in browser; standardized format.
TensorFlow.js - TensorFlow models in browser and Node.js; useful for vision tasks and edge deployment.

Serve at scale

vLLM - High-throughput inference engine increasingly relevant for multimodal serving.
SGLang - Fast serving and structured generation framework.
TensorRT-LLM - NVIDIA-optimized inference stack.
TGI - Hugging Face serving stack.
BentoML - Production model serving and packaging.
Ray Serve - Scalable service orchestration for model workloads.

APIs and interfaces

Open WebUI - Popular self-hosted chat UI for local models.
Lobe Chat - Polished interface for model backends.
LibreChat - Open chat UI with multi-backend support.
Flowise - Visual builder for LLM and multimodal pipelines.

Document AI, OCR, and Chart Understanding

This is one of the biggest practical VLM opportunities.

Models and tools

PaddleOCR - Strong OCR toolkit and a common baseline for document pipelines.
docTR - OCR for document text detection and recognition.
Nougat - OCR-style document understanding for scientific PDFs.
Donut - OCR-free document understanding model.
Pix2Struct - Document understanding without OCR; excellent for structured layouts, charts, and complex documents.
LayoutLM - Important family for document layout understanding.
DocLayout-YOLO - Modern layout detection and segmentation for complex documents.
MinerU - Open document parsing and PDF extraction tooling.
Marker - PDF-to-markdown/document extraction workflow.
Surya - OCR and layout toolkit.
PyMuPDF - Robust PDF handling and preprocessing; critical for document pipelines.
ChartOCR - Chart understanding reference.
ChartQA - Dataset and benchmark for chart reasoning.

Practical stacks

OCR + VLM reranking for robust document QA
PDF-to-markdown + VLM summarization
screenshot parsing for SaaS analytics and support tooling
chart extraction + VLM reasoning for BI workflows

UI Understanding and Computer Use

Another high-value gap: builders want models that can read screens and act.

Projects and references

Anthropic Computer Use - Direct VLM-to-action API for web automation; Claude 3.5 Sonnet integration shipped late 2024.
SeeAct - Visual web agent framework.
OpenHands - Software agent platform; relevant for multimodal and browser-use workflows.
Browser Use - Browser automation with model control.
Stagehand - Browser automation framework aimed at AI-native workflows.
OmniParser - Screen parsing for GUI grounding and action planning.
UI-TARS - UI-centric agent/model direction.
GroundingDINO - Key building block for screen and visual grounding.
SAM 2 - Segmentation backbone useful for visual agents and annotation loops.
ScreenQA - Benchmark for evaluating screenshot understanding and UI navigation.

Strong use cases

support copilots that understand screenshots
QA agents for web apps
workflow automation over dashboards and back-office tools
computer-use agents with grounding and action planning

Agents, Grounding, and Robotics

Grounding and perception

CLIP & OpenCLIP - Foundation models for zero-shot detection, retrieval, and visual embeddings; critical for flexible grounding and cross-modal search.
GroundingDINO - Open-set detection from text prompts.
YOLO-World - Open-vocabulary detection combining YOLO efficiency with open-set capabilities.
Segment Anything - Foundational segmentation model.
SAM 2 - Video-capable segmentation and tracking direction.
OWL-ViT - Open-vocabulary detection.
Detic - Detection with image-level supervision and open-vocabulary flavor.
Florence-2 - Dense prediction and structured understanding for documents, captions, and visual grounding.
Depth-Anything - Monocular depth estimation; essential for 3D understanding, robotics, and spatial reasoning.

Robotics-oriented projects

OpenVLA - Vision-language-action direction for robotics.
LeRobot - Hugging Face robotics stack.
Open X-Embodiment - Cross-robot learning ecosystem.
RT-1 - Robotics transformer reference.
RT-2 - Vision-language-action robotics reference.

Why this section matters

Many “awesome VLM” repos ignore the path from seeing to acting. This is where vision goes from demo to system.

Video and Long-Context Multimodality

Models and systems

Video-LLaVA - Video extension of LLaVA-style instruction tuning.
VideoChat2 - Video multimodal conversation direction.
LLaVA-NeXT-Video - Video-capable branch of the LLaVA family.
Qwen2-VL - Practical option for image and video understanding.
LongVU - Long video understanding direction.
VideoMAE - Self-supervised masked autoencoder for video; strong backbone for video understanding and retrieval.
Gemini 2.0 Video - State-of-the-art multimodal video reasoning with native temporal understanding.

Product opportunities

meeting and call analysis
industrial inspection
sports video understanding
security review
agent memory over recorded workflows

Training and Fine-Tuning

Libraries

Transformers - The default ecosystem for many multimodal models.
TRL - Preference optimization and instruction-tuning workflows.
PEFT - Parameter-efficient fine-tuning.
Axolotl - Popular fine-tuning framework.
LLaMA-Factory - Large fine-tuning platform with multimodal support.
DeepSpeed - Distributed training and optimization.
PyTorch Lightning - Structured model training workflows.
OpenFlamingo - Open framework for Flamingo-style multimodal modeling.
LAVIS - Vision-language research and training toolkit.

Common tuning patterns

LoRA on the language model head
projector tuning between vision encoder and language model
instruction tuning on screenshot or document tasks
synthetic data generation for niche workflows
DPO or preference tuning for agent outputs

Benchmarks and Evaluation

General VLM evaluation

MMMU - Massive multitask multimodal reasoning benchmark.
MMBench - Broad multimodal benchmark.
MM-Vet - Challenging evaluation for integrated multimodal capabilities.
SEED-Bench - Foundation multimodal benchmark suite.
POPE - Evaluates object hallucination.
HallusionBench - Hallucination-focused benchmark.
ScienceQA - Science reasoning benchmark with multimodal settings.
TextVQA - Text-centric visual QA.
VQAv2 - Classic visual question answering benchmark.
ChartQA - Chart reasoning benchmark.
DocVQA - Document visual question answering benchmark.

What to evaluate in real products

hallucination on screenshots and documents
OCR robustness on messy inputs
grounding consistency
latency and memory under real batch sizes
action reliability for UI agents
failure modes on out-of-domain images

Datasets

Image-text pretraining

LAION-5B - Large-scale image-text resource.
Conceptual Captions - Image-caption dataset.
COCO Captions - Standard image captioning dataset.
Visual Genome - Dense visual annotations and relationships.
RedCaps - Large image-text dataset from Reddit.
DataComp - Dataset curation benchmark ecosystem.

Instruction and conversational multimodal data

LLaVA-Instruct-150K - Foundational multimodal instruction dataset.
ShareGPT4V - Large multimodal instruction-style dataset.
M3IT - Multimodal instruction tuning data collection.

Document and chart data

DocVQA
InfographicVQA
ChartQA
PubLayNet - Document layout annotations.
FUNSD - Form understanding benchmark.

Grounding and detection data

RefCOCO / RefCOCO+ / RefCOCOg - Referring expression benchmarks.
Objects365 - Detection dataset.
Open Images - Large detection and localization dataset.

Applications and Demos

Useful open projects

ComfyUI - Important ecosystem for visual model workflows.
InvokeAI - Image generation stack, useful alongside multimodal systems.
Label Studio - Data labeling for multimodal training loops.
FiftyOne - Dataset inspection and vision evaluation.
Unstructured - Document preprocessing for multimodal pipelines.
Haystack - RAG framework adaptable to multimodal retrieval.
LlamaIndex - Retrieval and agents, with multimodal extensions.
LangChain - Orchestration stack with multimodal integrations.

Learning Resources

Repositories and hubs

Papers with Code - Vision Language
Hugging Face Multimodal Tasks
OpenCompass - Evaluation ecosystem.
LAVIS - Great for studying model families.

Topics worth learning deeply

projector-based multimodal alignment
OCR-free vs OCR-plus-VLM document stacks
visual grounding for action
hallucination detection in VLMs
long-context video reasoning
multimodal retrieval and indexing

Opinionated Starter Paths

I want to build a local screenshot copilot

Start with:

Qwen2-VL or LLaVA-style model
Ollama or vLLM
OmniParser or GroundingDINO for UI structure
Playwright, Browser Use, or Stagehand for actions

I want to build document AI

Start with:

MinerU, Marker, Surya, PaddleOCR, or docTR
Donut or a strong general VLM for semantic reasoning
multimodal RAG over pages, tables, and extracted markdown

I want to explore robotics

Start with:

OpenVLA
LeRobot
GroundingDINO + SAM 2
RT-1 and RT-2 papers for system design

I want to benchmark models before choosing one

Start with:

MMMU
MMBench
MM-Vet
HallusionBench
task-specific evals on your own screenshots or PDFs

What makes a resource a good fit for this repo?

We prioritize resources that are:

useful for builders, not only for citation graphs
open or accessible enough to try
relevant to local deployment, evaluation, or productization
focused on image, document, video, UI, or grounded multimodal workflows

We do not aim to be a giant paper dump.

Contributing

Please read CONTRIBUTING.md.

Good additions include:

practical VLM repos
local runtimes with multimodal support
document AI tooling
UI agent frameworks
evaluation suites
strong tutorials and implementation guides

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Vision AI Stack

Why this list exists

Contents

Foundation VLMs

General-purpose open models

Research landmarks

Models people actually test in products

Local Inference and Serving

Run locally

Run in browser (zero-setup inference)

Serve at scale

APIs and interfaces

Document AI, OCR, and Chart Understanding

Models and tools

Practical stacks

UI Understanding and Computer Use

Projects and references

Strong use cases

Agents, Grounding, and Robotics

Grounding and perception

Robotics-oriented projects

Why this section matters

Video and Long-Context Multimodality

Models and systems

Product opportunities

Training and Fine-Tuning

Libraries

Common tuning patterns

Benchmarks and Evaluation

General VLM evaluation

What to evaluate in real products

Datasets

Image-text pretraining

Instruction and conversational multimodal data

Document and chart data

Grounding and detection data

Applications and Demos

Great categories to build in

Useful open projects

Learning Resources

Repositories and hubs

Topics worth learning deeply

Opinionated Starter Paths

I want to build a local screenshot copilot

I want to build document AI

I want to explore robotics

I want to benchmark models before choosing one

What makes a resource a good fit for this repo?

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages