A curated list of tools, frameworks, and resources for running AI models on your own hardware. No cloud required.
If you run models on your own GPU instead of calling an API, this list is for you. Every tool listed here works offline after initial setup.
- Inference Engines
- Python & Language Bindings
- Quantization & Optimization
- Fine-Tuning
- RAG & Knowledge
- Orchestration & Agents
- Model Management
- Monitoring & Profiling
- UI & Interfaces
- Code Assistants
- Voice & Audio
- Image & Video
- Privacy-First Tools
- Unified Gateways
- Guides
- Hardware Guides
Run LLMs on your machine.
- llama.cpp - LLM inference in C/C++ with GGUF quantization. The backbone of most local AI setups. Supports NVIDIA, AMD, Intel, Apple Silicon.
- Ollama - Docker-like experience for running local models. One command to download and run. Wraps llama.cpp with a model registry.
- vLLM - High-throughput serving engine with PagedAttention for efficient VRAM usage. Production-grade, supports GGUF and most formats.
- MLX LM - Apple Silicon optimized inference. Uses unified memory efficiently. If you have an M-series Mac, start here.
- LocalAI - OpenAI API-compatible server supporting 35+ backends. Drop-in replacement for any OpenAI client.
- ExLlamaV2 - Fast single-user inference on NVIDIA GPUs for EXL2 and GPTQ quantized models. Low VRAM overhead.
- llamafile - Distribute LLMs as single executable files. No install, just run. Cross-platform.
- KoboldCpp - One-file GGUF inference with built-in UI. Good for creative writing and roleplay.
- SGLang - Structured generation with RadixAttention for fast multi-turn conversations. Good for agentic workloads.
- TensorRT-LLM - NVIDIA optimized inference. High throughput on NVIDIA hardware, more setup required.
- Candle - Minimalist ML framework in Rust. Low memory footprint, good for embedded use cases.
- GPT4All - Desktop app for local models. Download a model, start chatting. Includes document Q&A.
- LM Studio - Desktop app for discovering, downloading, and running local LLMs. Built-in chat UI and local API server. Good for non-technical users.
- IPEX-LLM - Optimized inference for Intel GPUs (Arc, Flex, Max series). Required for native Intel GPU acceleration.
- mistral.rs - Rust-based inference with ISQ quantization and multimodal support. Low memory usage.
- MLC LLM - Universal deployment of LLMs across hardware backends. Compile once, run anywhere (GPU, mobile, browser).
- Xinference - Distributed inference platform for LLMs, speech, images. Built-in model hub and OpenAI-compatible API.
- FastChat - Serving platform from LMSYS (creators of Chatbot Arena). Multi-model serving with OpenAI-compatible API.
- Aphrodite Engine - High-throughput inference engine forked from vLLM. Optimized for multi-user serving with GPTQ, AWQ, EXL2 support.
- LMDeploy - Toolkit for compressing, quantizing, and serving LLMs from InternLM. Up to 1.8x throughput vs vLLM on supported models.
Use inference engines from your code.
- llama-cpp-python - Python bindings for llama.cpp with OpenAI-compatible API server. The most popular way to use llama.cpp from Python.
- Ollama Python - Python client for Ollama API. Simple, well-documented.
- Ollama JS - JavaScript/TypeScript client for Ollama.
- ctransformers - Python bindings for GGML models with GPU acceleration. Lighter alternative to llama-cpp-python.
Make models smaller and faster without retraining.
- GPTQ - Post-training quantization for GPT models. 4-bit with minimal quality loss. Widely used on HuggingFace.
- AWQ - Activation-aware weight quantization. Often better quality than GPTQ at same bit width.
- bitsandbytes - 8-bit and 4-bit quantization for PyTorch. Required for QLoRA fine-tuning.
- xformers - Memory-efficient attention implementations. Saves VRAM on long contexts.
- NVIDIA Model Optimizer - Quantization, pruning, and distillation tools for TensorRT and vLLM.
- llama.cpp convert scripts - Convert HuggingFace models to GGUF format. Essential for llama.cpp and Ollama.
Train and adapt models on your own data locally.
- Unsloth - 2x faster fine-tuning with 80% less memory. QLoRA and LoRA on a single GPU.
- axolotl - Streamlined fine-tuning with YAML configs. Supports LoRA, QLoRA, DPO, RLHF. Handles multi-GPU setups.
- PEFT - Parameter-Efficient Fine-Tuning from HuggingFace. LoRA, prefix tuning, adapters.
- TRL - Transformer Reinforcement Learning. RLHF, DPO, PPO training for language models. Works with PEFT.
- LLaMA-Factory - Web UI for fine-tuning 100+ models. No coding needed. Supports LoRA, QLoRA, full fine-tuning.
- torchtune - PyTorch-native fine-tuning library. Clean, hackable, well-documented. From the PyTorch team.
Build local knowledge systems without sending your data anywhere.
- LlamaIndex - Data framework for RAG pipelines. Index your docs, query with LLMs. Supports 160+ data sources.
- LangChain - Framework for chaining LLM calls with tools, memory, and retrieval. Large ecosystem of integrations.
- ChromaDB - Embedded vector database. pip install, start storing embeddings. No server needed.
- Qdrant - Vector database with filtering, payload storage, and GPU acceleration. Good for production workloads.
- Milvus - Scalable vector database. Overkill for personal use, good for teams and large datasets.
- Weaviate - Vector database with hybrid search (vectors + BM25). Built-in modules for text/image.
- FAISS - Similarity search library from Meta. Fast, battle-tested. The underlying engine many tools use.
- Sentence-Transformers - Compute embeddings locally. Pair with any vector DB above. 5000+ pre-trained models.
- Unstructured - Extract text from PDFs, Word docs, HTML, images. Pre-processing for RAG pipelines.
- Mem0 - Memory layer for AI assistants. Persistent context across conversations with local embedding and vector storage.
Coordinate multiple models and tools locally.
- CrewAI - Multi-agent framework with role-based collaboration. Works with local models via Ollama.
- LangGraph - Stateful multi-actor orchestration. Build complex agent workflows as graphs.
- AutoGen - Microsoft multi-agent framework. Note: AG2 is the community fork with active development.
- AG2 - Community-driven fork of AutoGen. Active development, multi-agent orchestration.
- Semantic Kernel - Microsoft SDK for AI agent apps. .NET, Python, Java. Enterprise-focused.
- DSPy - Programming framework for LM pipelines. Optimizes prompts and weights automatically. From Stanford NLP.
- Dify - Visual workflow builder for LLM apps. Drag-and-drop RAG, agents, chatbots. Self-hostable with local model support.
- Flowise - Low-code LLM app builder with drag-and-drop UI. Connects to Ollama and local APIs.
- Open Interpreter - Natural language interface to your computer. Runs code locally, controls files and apps. Works with local models via LiteLLM.
- NeMo Guardrails - Programmable guardrails for LLM apps from NVIDIA. Input/output filters, fact-checking, dialogue policies. Wraps any local model.
- Outlines - Structured generation for LLMs. Force outputs to follow JSON schema, regex, or context-free grammars. Works with llama.cpp, vLLM, transformers.
- Instructor - Reliable structured outputs from any LLM via Pydantic models. Retries on validation failure. Works with local models via Ollama and LiteLLM.
Download, convert, and organize your local model collection.
- HuggingFace Hub - Download models from HuggingFace. CLI and Python API. Where most open models are hosted.
- Transformers - Load and run most open models. Foundation library for HuggingFace ecosystem.
- GGML - Tensor library behind llama.cpp. Low-level but fast. Powers the GGUF ecosystem.
- Ollama Model Library - Pre-quantized models ready to pull with one command. Easiest way to get started.
Know what your GPU is actually doing.
- GPUStack - Manage GPU clusters. Scheduling, monitoring, multi-node inference.
- nvitop - Interactive NVIDIA GPU monitoring. Like htop for GPUs. Real-time process and memory tracking.
- Triton Inference Server - NVIDIA production inference server with model analytics and dynamic batching.
- Langfuse - Open-source LLM observability. Tracing, evals, and prompt management. Self-hostable.
- LangSmith - LLM tracing and evaluation platform. Cloud-based (not local), but useful for debugging local model pipelines.
- nvidia-smi - Built-in NVIDIA tool.
watch -n 1 nvidia-smifor real-time GPU monitoring. - gpu-memory-guard - Check available VRAM before loading GGUF models. Prevents OOM crashes with a single CLI command.
- Garak - LLM vulnerability scanner from NVIDIA. Probes for prompt injection, data leakage, jailbreaks, hallucination. Test your local model before shipping.
- Promptfoo - CLI and library for evaluating and red-teaming LLM apps. Run test cases against local models, compare prompts, find regressions.
Chat with your local models through a proper interface.
- Open WebUI - ChatGPT-like interface for Ollama and OpenAI-compatible APIs. RAG, web search, model switching built in. Active development.
- Text Generation WebUI - Gradio UI with extensions. Supports many backends and model formats. Highly configurable.
- Jan - Desktop app for local AI. Clean UI, offline-first. Downloads and manages models for you.
- Msty - Desktop app connecting to local and remote models. Free tier available. Clean, native feel.
- Hollama - Minimal web UI for Ollama. Lightweight alternative to Open WebUI. Quick to set up.
- SillyTavern - Chat interface with character cards, group chats, extensions. Popular in the creative/RP community.
AI-powered coding without sending your code to the cloud.
- Continue - VS Code / JetBrains AI pair programmer. Works with Ollama, llama.cpp, any local model.
- Tabby - Self-hosted code completion and chat. Runs on your GPU. GitHub Copilot alternative.
- Aider - AI pair programming in your terminal. Edit files, run tests, commit. Works with local models via Ollama/LiteLLM.
- llm - CLI tool for interacting with LLMs. Plugins for local models. From the creator of Datasette.
Speech recognition and synthesis on your hardware.
- Whisper - OpenAI speech-to-text. Runs locally despite the name. Multiple model sizes from tiny to large-v3.
- Faster-Whisper - 4x faster Whisper using CTranslate2. Same quality, less VRAM. Drop-in replacement.
- WhisperX - Whisper with word-level timestamps and speaker diarization. Good for transcription workflows.
- Piper - Fast local text-to-speech. Runs on a Raspberry Pi. Many voice models available.
- Speaches - OpenAI-compatible speech API server. Drop-in local replacement for OpenAI TTS/STT.
- WhisperKit - On-device speech recognition optimized for Apple Silicon. Swift framework.
- WhisperLive - Real-time streaming transcription. Low latency, works with microphone input.
- Bark - Text-to-audio generation. Voice cloning, music, sound effects. Runs locally on GPU.
- F5-TTS - High-quality text-to-speech with voice cloning from 15s reference audio. Diffusion-based.
- Kokoro TTS - Lightweight text-to-speech with natural-sounding voices. 82M parameters, runs on CPU. Apache 2.0 licensed.
Generate and edit images locally.
- ComfyUI - Node-based workflow for Stable Diffusion, FLUX, SD3, SDXL. Visual pipeline editor.
- Stable Diffusion WebUI - Feature-rich Gradio interface for Stable Diffusion. Large extension ecosystem.
- Forge - Optimized SD WebUI fork. Better VRAM efficiency and speed. Good for new setups.
- Fooocus - Simplified image generation. Fewer options, faster results. Good for non-technical users.
- InvokeAI - Creative engine for Stable Diffusion with professional UI. Node-based workflows.
- FLUX.1 - Image generation from Black Forest Labs. FLUX.1-dev runs locally on 12GB+ VRAM.
Interact with documents without data leaving your machine.
- PrivateGPT - Chat with your documents 100% privately. RAG pipeline built in. No data leaves your machine.
- LocalGPT - Chat with documents using local models. Inspired by PrivateGPT with more model options.
- Danswer - Search and chat over internal documents. Can use local models. Self-hosted.
- AnythingLLM - All-in-one AI desktop app with local RAG and agent support. Connects to Ollama.
Route requests across multiple model backends.
- LiteLLM - Call 100+ LLM APIs through one interface. Route between local and cloud. Load balancing, fallbacks, cost tracking.
- OpenRouter - Unified API for many providers. Cloud-based (not local), but useful as fallback when local GPU is busy.
Step-by-step guides for common tasks.
- Getting Started - Install Ollama, run your first model, add a chat UI, use the API. Under 10 minutes.
- Choosing an Inference Engine - Decision tables, performance benchmarks, and model format compatibility across 10 engines.
- VRAM Requirements - GPU memory calculator, what fits on your card, popular model sizes, and tips for fitting larger models.
Figure out what you need and what fits.
- GPU Benchmarks for LLM Inference - Performance comparison across consumer GPUs. Real benchmarks, not marketing.
- Can I Run This Model? - VRAM calculator. Enter your GPU, see what fits.
- LocalLLaMA Wiki - Community-maintained knowledge base for local AI hardware and software.
- LLM VRAM Requirements - How to calculate VRAM needs for any model.
Contributions welcome. See CONTRIBUTING.md for guidelines.
- Awesome LLM - Broader LLM resources.
- Awesome Generative AI - Generative AI tools and research.
- Awesome Self-Hosted - Self-hosted software.