A curated, builder-first list of Vision Language Models (VLMs), local runtimes, document AI tools, UI agents, robotics vision stacks, datasets, benchmarks, and production resources.
The goal is not to collect every paper. The goal is to help developers build with vision.
Most multimodal lists are strong on papers but weak on execution.
This one is optimized for builders who want to answer questions like:
- Which VLMs can I run locally?
- Which models are good for documents, charts, OCR, and screenshots?
- What should I use for UI agents and computer-use workflows?
- Which stacks help with robotics, grounding, and action?
- How do I benchmark, fine-tune, and ship a multimodal product?
- Foundation VLMs
- Local Inference and Serving
- Document AI, OCR, and Chart Understanding
- UI Understanding and Computer Use
- Agents, Grounding, and Robotics
- Video and Long-Context Multimodality
- Training and Fine-Tuning
- Benchmarks and Evaluation
- Datasets
- Applications and Demos
- Learning Resources
- Contributing
- LLaVA - One of the most influential open visual instruction-tuned models.
- Qwen2-VL - Strong open multimodal family for image, document, and video understanding.
- Qwen-VL - Earlier Qwen multimodal line with broad ecosystem support.
- InternVL - Strong family of open large vision-language models.
- CogVLM - Open visual language model family from THUDM.
- MiniGPT-4 - Early and influential image-chat system.
- InstructBLIP - Instruction-tuned extension of BLIP-style architectures.
- BLIP-2 - Efficient VLM architecture connecting frozen vision and language models.
- IDEFICS - Hugging Face open multimodal family.
- DeepSeek-VL - Open multimodal reasoning models from DeepSeek.
- Molmo - Open multimodal assistant from Ai2 with strong grounding focus.
- Phi-3 Vision - Compact multimodal model useful for practical deployments.
- Fuyu - Multimodal autoregressive model with a distinct design.
- Gemma Vision - Google's open multimodal model; efficient vision-language understanding with strong performance on document and image reasoning.
- SmolVLM - Small multimodal models for lightweight use cases.
- Moondream - Lightweight open-source VLM optimized for efficiency and local deployment.
- MedGEMMA - Medical-focused vision language model from Google for healthcare applications.
- Flamingo - Landmark few-shot visual language model.
- Kosmos-1 - Early multimodal reasoning and grounding work.
- PaLI - Scalable multilingual vision-language model.
- PaLI-X - Larger multimodal extension of PaLI.
- Kosmos-2 - Grounded multimodal large language model.
- SEED-Bench ecosystem - Useful benchmark family around multimodal reasoning.
- LLaVA-style models for rapid prototyping
- Qwen2-VL family for stronger general multimodal performance
- InternVL for competitive open performance
- Phi-3 Vision and SmolVLM for lighter deployments
- Molmo for grounding-heavy exploration
- Ollama - Local model runtime with growing multimodal support.
- LM Studio - Desktop app for running local models with a friendly UI.
- Jan - Open local AI runtime and desktop app.
- llama.cpp - Core local inference stack; important for lightweight experimentation.
- MLC-LLM - Compile and deploy models on edge and mobile devices.
- OpenVINO - Useful for Intel-optimized deployments.
- WebLLM - Run VLMs natively in the browser without backend server.
- Transformers.js - Hugging Face models (vision and audio) in JavaScript; enables client-side multimodal inference.
- ONNX Runtime Web - Cross-platform model inference in browser; standardized format.
- TensorFlow.js - TensorFlow models in browser and Node.js; useful for vision tasks and edge deployment.
- vLLM - High-throughput inference engine increasingly relevant for multimodal serving.
- SGLang - Fast serving and structured generation framework.
- TensorRT-LLM - NVIDIA-optimized inference stack.
- TGI - Hugging Face serving stack.
- BentoML - Production model serving and packaging.
- Ray Serve - Scalable service orchestration for model workloads.
- Open WebUI - Popular self-hosted chat UI for local models.
- Lobe Chat - Polished interface for model backends.
- LibreChat - Open chat UI with multi-backend support.
- Flowise - Visual builder for LLM and multimodal pipelines.
This is one of the biggest practical VLM opportunities.
- PaddleOCR - Strong OCR toolkit and a common baseline for document pipelines.
- docTR - OCR for document text detection and recognition.
- Nougat - OCR-style document understanding for scientific PDFs.
- Donut - OCR-free document understanding model.
- Pix2Struct - Document understanding without OCR; excellent for structured layouts, charts, and complex documents.
- LayoutLM - Important family for document layout understanding.
- DocLayout-YOLO - Modern layout detection and segmentation for complex documents.
- MinerU - Open document parsing and PDF extraction tooling.
- Marker - PDF-to-markdown/document extraction workflow.
- Surya - OCR and layout toolkit.
- PyMuPDF - Robust PDF handling and preprocessing; critical for document pipelines.
- ChartOCR - Chart understanding reference.
- ChartQA - Dataset and benchmark for chart reasoning.
- OCR + VLM reranking for robust document QA
- PDF-to-markdown + VLM summarization
- screenshot parsing for SaaS analytics and support tooling
- chart extraction + VLM reasoning for BI workflows
Another high-value gap: builders want models that can read screens and act.
- Anthropic Computer Use - Direct VLM-to-action API for web automation; Claude 3.5 Sonnet integration shipped late 2024.
- SeeAct - Visual web agent framework.
- OpenHands - Software agent platform; relevant for multimodal and browser-use workflows.
- Browser Use - Browser automation with model control.
- Stagehand - Browser automation framework aimed at AI-native workflows.
- OmniParser - Screen parsing for GUI grounding and action planning.
- UI-TARS - UI-centric agent/model direction.
- GroundingDINO - Key building block for screen and visual grounding.
- SAM 2 - Segmentation backbone useful for visual agents and annotation loops.
- ScreenQA - Benchmark for evaluating screenshot understanding and UI navigation.
- support copilots that understand screenshots
- QA agents for web apps
- workflow automation over dashboards and back-office tools
- computer-use agents with grounding and action planning
- CLIP & OpenCLIP - Foundation models for zero-shot detection, retrieval, and visual embeddings; critical for flexible grounding and cross-modal search.
- GroundingDINO - Open-set detection from text prompts.
- YOLO-World - Open-vocabulary detection combining YOLO efficiency with open-set capabilities.
- Segment Anything - Foundational segmentation model.
- SAM 2 - Video-capable segmentation and tracking direction.
- OWL-ViT - Open-vocabulary detection.
- Detic - Detection with image-level supervision and open-vocabulary flavor.
- Florence-2 - Dense prediction and structured understanding for documents, captions, and visual grounding.
- Depth-Anything - Monocular depth estimation; essential for 3D understanding, robotics, and spatial reasoning.
- OpenVLA - Vision-language-action direction for robotics.
- LeRobot - Hugging Face robotics stack.
- Open X-Embodiment - Cross-robot learning ecosystem.
- RT-1 - Robotics transformer reference.
- RT-2 - Vision-language-action robotics reference.
Many “awesome VLM” repos ignore the path from seeing to acting. This is where vision goes from demo to system.
- Video-LLaVA - Video extension of LLaVA-style instruction tuning.
- VideoChat2 - Video multimodal conversation direction.
- LLaVA-NeXT-Video - Video-capable branch of the LLaVA family.
- Qwen2-VL - Practical option for image and video understanding.
- LongVU - Long video understanding direction.
- VideoMAE - Self-supervised masked autoencoder for video; strong backbone for video understanding and retrieval.
- Gemini 2.0 Video - State-of-the-art multimodal video reasoning with native temporal understanding.
- meeting and call analysis
- industrial inspection
- sports video understanding
- security review
- agent memory over recorded workflows
- Transformers - The default ecosystem for many multimodal models.
- TRL - Preference optimization and instruction-tuning workflows.
- PEFT - Parameter-efficient fine-tuning.
- Axolotl - Popular fine-tuning framework.
- LLaMA-Factory - Large fine-tuning platform with multimodal support.
- DeepSpeed - Distributed training and optimization.
- PyTorch Lightning - Structured model training workflows.
- OpenFlamingo - Open framework for Flamingo-style multimodal modeling.
- LAVIS - Vision-language research and training toolkit.
- LoRA on the language model head
- projector tuning between vision encoder and language model
- instruction tuning on screenshot or document tasks
- synthetic data generation for niche workflows
- DPO or preference tuning for agent outputs
- MMMU - Massive multitask multimodal reasoning benchmark.
- MMBench - Broad multimodal benchmark.
- MM-Vet - Challenging evaluation for integrated multimodal capabilities.
- SEED-Bench - Foundation multimodal benchmark suite.
- POPE - Evaluates object hallucination.
- HallusionBench - Hallucination-focused benchmark.
- ScienceQA - Science reasoning benchmark with multimodal settings.
- TextVQA - Text-centric visual QA.
- VQAv2 - Classic visual question answering benchmark.
- ChartQA - Chart reasoning benchmark.
- DocVQA - Document visual question answering benchmark.
- hallucination on screenshots and documents
- OCR robustness on messy inputs
- grounding consistency
- latency and memory under real batch sizes
- action reliability for UI agents
- failure modes on out-of-domain images
- LAION-5B - Large-scale image-text resource.
- Conceptual Captions - Image-caption dataset.
- COCO Captions - Standard image captioning dataset.
- Visual Genome - Dense visual annotations and relationships.
- RedCaps - Large image-text dataset from Reddit.
- DataComp - Dataset curation benchmark ecosystem.
- LLaVA-Instruct-150K - Foundational multimodal instruction dataset.
- ShareGPT4V - Large multimodal instruction-style dataset.
- M3IT - Multimodal instruction tuning data collection.
- DocVQA
- InfographicVQA
- ChartQA
- PubLayNet - Document layout annotations.
- FUNSD - Form understanding benchmark.
- RefCOCO / RefCOCO+ / RefCOCOg - Referring expression benchmarks.
- Objects365 - Detection dataset.
- Open Images - Large detection and localization dataset.
- screenshot QA
- browser automation copilots
- invoice and contract extraction
- multimodal customer support
- chart analysis copilots
- accessibility tools
- robotics perception and action
- multimodal RAG for images, PDFs, and video
- ComfyUI - Important ecosystem for visual model workflows.
- InvokeAI - Image generation stack, useful alongside multimodal systems.
- Label Studio - Data labeling for multimodal training loops.
- FiftyOne - Dataset inspection and vision evaluation.
- Unstructured - Document preprocessing for multimodal pipelines.
- Haystack - RAG framework adaptable to multimodal retrieval.
- LlamaIndex - Retrieval and agents, with multimodal extensions.
- LangChain - Orchestration stack with multimodal integrations.
- Papers with Code - Vision Language
- Hugging Face Multimodal Tasks
- OpenCompass - Evaluation ecosystem.
- LAVIS - Great for studying model families.
- projector-based multimodal alignment
- OCR-free vs OCR-plus-VLM document stacks
- visual grounding for action
- hallucination detection in VLMs
- long-context video reasoning
- multimodal retrieval and indexing
Start with:
- Qwen2-VL or LLaVA-style model
- Ollama or vLLM
- OmniParser or GroundingDINO for UI structure
- Playwright, Browser Use, or Stagehand for actions
Start with:
- MinerU, Marker, Surya, PaddleOCR, or docTR
- Donut or a strong general VLM for semantic reasoning
- multimodal RAG over pages, tables, and extracted markdown
Start with:
- OpenVLA
- LeRobot
- GroundingDINO + SAM 2
- RT-1 and RT-2 papers for system design
Start with:
- MMMU
- MMBench
- MM-Vet
- HallusionBench
- task-specific evals on your own screenshots or PDFs
We prioritize resources that are:
- useful for builders, not only for citation graphs
- open or accessible enough to try
- relevant to local deployment, evaluation, or productization
- focused on image, document, video, UI, or grounded multimodal workflows
We do not aim to be a giant paper dump.
Please read CONTRIBUTING.md.
Good additions include:
- practical VLM repos
- local runtimes with multimodal support
- document AI tooling
- UI agent frameworks
- evaluation suites
- strong tutorials and implementation guides