Skip to content

edujbarrios/awesome-vision-ai-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Vision AI Stack Awesome

A curated, builder-first list of Vision Language Models (VLMs), local runtimes, document AI tools, UI agents, robotics vision stacks, datasets, benchmarks, and production resources.

The goal is not to collect every paper. The goal is to help developers build with vision.

Why this list exists

Most multimodal lists are strong on papers but weak on execution.

This one is optimized for builders who want to answer questions like:

  • Which VLMs can I run locally?
  • Which models are good for documents, charts, OCR, and screenshots?
  • What should I use for UI agents and computer-use workflows?
  • Which stacks help with robotics, grounding, and action?
  • How do I benchmark, fine-tune, and ship a multimodal product?

Contents


Foundation VLMs

General-purpose open models

  • LLaVA - One of the most influential open visual instruction-tuned models.
  • Qwen2-VL - Strong open multimodal family for image, document, and video understanding.
  • Qwen-VL - Earlier Qwen multimodal line with broad ecosystem support.
  • InternVL - Strong family of open large vision-language models.
  • CogVLM - Open visual language model family from THUDM.
  • MiniGPT-4 - Early and influential image-chat system.
  • InstructBLIP - Instruction-tuned extension of BLIP-style architectures.
  • BLIP-2 - Efficient VLM architecture connecting frozen vision and language models.
  • IDEFICS - Hugging Face open multimodal family.
  • DeepSeek-VL - Open multimodal reasoning models from DeepSeek.
  • Molmo - Open multimodal assistant from Ai2 with strong grounding focus.
  • Phi-3 Vision - Compact multimodal model useful for practical deployments.
  • Fuyu - Multimodal autoregressive model with a distinct design.
  • Gemma Vision - Google's open multimodal model; efficient vision-language understanding with strong performance on document and image reasoning.
  • SmolVLM - Small multimodal models for lightweight use cases.
  • Moondream - Lightweight open-source VLM optimized for efficiency and local deployment.
  • MedGEMMA - Medical-focused vision language model from Google for healthcare applications.

Research landmarks

  • Flamingo - Landmark few-shot visual language model.
  • Kosmos-1 - Early multimodal reasoning and grounding work.
  • PaLI - Scalable multilingual vision-language model.
  • PaLI-X - Larger multimodal extension of PaLI.
  • Kosmos-2 - Grounded multimodal large language model.
  • SEED-Bench ecosystem - Useful benchmark family around multimodal reasoning.

Models people actually test in products

  • LLaVA-style models for rapid prototyping
  • Qwen2-VL family for stronger general multimodal performance
  • InternVL for competitive open performance
  • Phi-3 Vision and SmolVLM for lighter deployments
  • Molmo for grounding-heavy exploration

Local Inference and Serving

Run locally

  • Ollama - Local model runtime with growing multimodal support.
  • LM Studio - Desktop app for running local models with a friendly UI.
  • Jan - Open local AI runtime and desktop app.
  • llama.cpp - Core local inference stack; important for lightweight experimentation.
  • MLC-LLM - Compile and deploy models on edge and mobile devices.
  • OpenVINO - Useful for Intel-optimized deployments.

Run in browser (zero-setup inference)

  • WebLLM - Run VLMs natively in the browser without backend server.
  • Transformers.js - Hugging Face models (vision and audio) in JavaScript; enables client-side multimodal inference.
  • ONNX Runtime Web - Cross-platform model inference in browser; standardized format.
  • TensorFlow.js - TensorFlow models in browser and Node.js; useful for vision tasks and edge deployment.

Serve at scale

  • vLLM - High-throughput inference engine increasingly relevant for multimodal serving.
  • SGLang - Fast serving and structured generation framework.
  • TensorRT-LLM - NVIDIA-optimized inference stack.
  • TGI - Hugging Face serving stack.
  • BentoML - Production model serving and packaging.
  • Ray Serve - Scalable service orchestration for model workloads.

APIs and interfaces

  • Open WebUI - Popular self-hosted chat UI for local models.
  • Lobe Chat - Polished interface for model backends.
  • LibreChat - Open chat UI with multi-backend support.
  • Flowise - Visual builder for LLM and multimodal pipelines.

Document AI, OCR, and Chart Understanding

This is one of the biggest practical VLM opportunities.

Models and tools

  • PaddleOCR - Strong OCR toolkit and a common baseline for document pipelines.
  • docTR - OCR for document text detection and recognition.
  • Nougat - OCR-style document understanding for scientific PDFs.
  • Donut - OCR-free document understanding model.
  • Pix2Struct - Document understanding without OCR; excellent for structured layouts, charts, and complex documents.
  • LayoutLM - Important family for document layout understanding.
  • DocLayout-YOLO - Modern layout detection and segmentation for complex documents.
  • MinerU - Open document parsing and PDF extraction tooling.
  • Marker - PDF-to-markdown/document extraction workflow.
  • Surya - OCR and layout toolkit.
  • PyMuPDF - Robust PDF handling and preprocessing; critical for document pipelines.
  • ChartOCR - Chart understanding reference.
  • ChartQA - Dataset and benchmark for chart reasoning.

Practical stacks

  • OCR + VLM reranking for robust document QA
  • PDF-to-markdown + VLM summarization
  • screenshot parsing for SaaS analytics and support tooling
  • chart extraction + VLM reasoning for BI workflows

UI Understanding and Computer Use

Another high-value gap: builders want models that can read screens and act.

Projects and references

  • Anthropic Computer Use - Direct VLM-to-action API for web automation; Claude 3.5 Sonnet integration shipped late 2024.
  • SeeAct - Visual web agent framework.
  • OpenHands - Software agent platform; relevant for multimodal and browser-use workflows.
  • Browser Use - Browser automation with model control.
  • Stagehand - Browser automation framework aimed at AI-native workflows.
  • OmniParser - Screen parsing for GUI grounding and action planning.
  • UI-TARS - UI-centric agent/model direction.
  • GroundingDINO - Key building block for screen and visual grounding.
  • SAM 2 - Segmentation backbone useful for visual agents and annotation loops.
  • ScreenQA - Benchmark for evaluating screenshot understanding and UI navigation.

Strong use cases

  • support copilots that understand screenshots
  • QA agents for web apps
  • workflow automation over dashboards and back-office tools
  • computer-use agents with grounding and action planning

Agents, Grounding, and Robotics

Grounding and perception

  • CLIP & OpenCLIP - Foundation models for zero-shot detection, retrieval, and visual embeddings; critical for flexible grounding and cross-modal search.
  • GroundingDINO - Open-set detection from text prompts.
  • YOLO-World - Open-vocabulary detection combining YOLO efficiency with open-set capabilities.
  • Segment Anything - Foundational segmentation model.
  • SAM 2 - Video-capable segmentation and tracking direction.
  • OWL-ViT - Open-vocabulary detection.
  • Detic - Detection with image-level supervision and open-vocabulary flavor.
  • Florence-2 - Dense prediction and structured understanding for documents, captions, and visual grounding.
  • Depth-Anything - Monocular depth estimation; essential for 3D understanding, robotics, and spatial reasoning.

Robotics-oriented projects

  • OpenVLA - Vision-language-action direction for robotics.
  • LeRobot - Hugging Face robotics stack.
  • Open X-Embodiment - Cross-robot learning ecosystem.
  • RT-1 - Robotics transformer reference.
  • RT-2 - Vision-language-action robotics reference.

Why this section matters

Many “awesome VLM” repos ignore the path from seeing to acting. This is where vision goes from demo to system.


Video and Long-Context Multimodality

Models and systems

  • Video-LLaVA - Video extension of LLaVA-style instruction tuning.
  • VideoChat2 - Video multimodal conversation direction.
  • LLaVA-NeXT-Video - Video-capable branch of the LLaVA family.
  • Qwen2-VL - Practical option for image and video understanding.
  • LongVU - Long video understanding direction.
  • VideoMAE - Self-supervised masked autoencoder for video; strong backbone for video understanding and retrieval.
  • Gemini 2.0 Video - State-of-the-art multimodal video reasoning with native temporal understanding.

Product opportunities

  • meeting and call analysis
  • industrial inspection
  • sports video understanding
  • security review
  • agent memory over recorded workflows

Training and Fine-Tuning

Libraries

  • Transformers - The default ecosystem for many multimodal models.
  • TRL - Preference optimization and instruction-tuning workflows.
  • PEFT - Parameter-efficient fine-tuning.
  • Axolotl - Popular fine-tuning framework.
  • LLaMA-Factory - Large fine-tuning platform with multimodal support.
  • DeepSpeed - Distributed training and optimization.
  • PyTorch Lightning - Structured model training workflows.
  • OpenFlamingo - Open framework for Flamingo-style multimodal modeling.
  • LAVIS - Vision-language research and training toolkit.

Common tuning patterns

  • LoRA on the language model head
  • projector tuning between vision encoder and language model
  • instruction tuning on screenshot or document tasks
  • synthetic data generation for niche workflows
  • DPO or preference tuning for agent outputs

Benchmarks and Evaluation

General VLM evaluation

  • MMMU - Massive multitask multimodal reasoning benchmark.
  • MMBench - Broad multimodal benchmark.
  • MM-Vet - Challenging evaluation for integrated multimodal capabilities.
  • SEED-Bench - Foundation multimodal benchmark suite.
  • POPE - Evaluates object hallucination.
  • HallusionBench - Hallucination-focused benchmark.
  • ScienceQA - Science reasoning benchmark with multimodal settings.
  • TextVQA - Text-centric visual QA.
  • VQAv2 - Classic visual question answering benchmark.
  • ChartQA - Chart reasoning benchmark.
  • DocVQA - Document visual question answering benchmark.

What to evaluate in real products

  • hallucination on screenshots and documents
  • OCR robustness on messy inputs
  • grounding consistency
  • latency and memory under real batch sizes
  • action reliability for UI agents
  • failure modes on out-of-domain images

Datasets

Image-text pretraining

Instruction and conversational multimodal data

  • LLaVA-Instruct-150K - Foundational multimodal instruction dataset.
  • ShareGPT4V - Large multimodal instruction-style dataset.
  • M3IT - Multimodal instruction tuning data collection.

Document and chart data

Grounding and detection data


Applications and Demos

Great categories to build in

  • screenshot QA
  • browser automation copilots
  • invoice and contract extraction
  • multimodal customer support
  • chart analysis copilots
  • accessibility tools
  • robotics perception and action
  • multimodal RAG for images, PDFs, and video

Useful open projects

  • ComfyUI - Important ecosystem for visual model workflows.
  • InvokeAI - Image generation stack, useful alongside multimodal systems.
  • Label Studio - Data labeling for multimodal training loops.
  • FiftyOne - Dataset inspection and vision evaluation.
  • Unstructured - Document preprocessing for multimodal pipelines.
  • Haystack - RAG framework adaptable to multimodal retrieval.
  • LlamaIndex - Retrieval and agents, with multimodal extensions.
  • LangChain - Orchestration stack with multimodal integrations.

Learning Resources

Repositories and hubs

Topics worth learning deeply

  • projector-based multimodal alignment
  • OCR-free vs OCR-plus-VLM document stacks
  • visual grounding for action
  • hallucination detection in VLMs
  • long-context video reasoning
  • multimodal retrieval and indexing

Opinionated Starter Paths

I want to build a local screenshot copilot

Start with:

  • Qwen2-VL or LLaVA-style model
  • Ollama or vLLM
  • OmniParser or GroundingDINO for UI structure
  • Playwright, Browser Use, or Stagehand for actions

I want to build document AI

Start with:

  • MinerU, Marker, Surya, PaddleOCR, or docTR
  • Donut or a strong general VLM for semantic reasoning
  • multimodal RAG over pages, tables, and extracted markdown

I want to explore robotics

Start with:

  • OpenVLA
  • LeRobot
  • GroundingDINO + SAM 2
  • RT-1 and RT-2 papers for system design

I want to benchmark models before choosing one

Start with:

  • MMMU
  • MMBench
  • MM-Vet
  • HallusionBench
  • task-specific evals on your own screenshots or PDFs

What makes a resource a good fit for this repo?

We prioritize resources that are:

  • useful for builders, not only for citation graphs
  • open or accessible enough to try
  • relevant to local deployment, evaluation, or productization
  • focused on image, document, video, UI, or grounded multimodal workflows

We do not aim to be a giant paper dump.


Contributing

Please read CONTRIBUTING.md.

Good additions include:

  • practical VLM repos
  • local runtimes with multimodal support
  • document AI tooling
  • UI agent frameworks
  • evaluation suites
  • strong tutorials and implementation guides

License

MIT

About

A curated, builder-first list of Vision Language Models (VLMs), local runtimes, document AI tools, UI agents, robotics vision stacks, datasets, benchmarks, and production resources.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors