Pucho AI is a lightweight, production-ready framework for running Large Language Models (LLMs) locally with a modern web-based chat interface. It enables GPU and CPU execution, works efficiently on systems with as little as 8GB RAM, and delivers response speeds comparable to cloud-hosted chatbots — while keeping all data fully private.
Designed for developers, AI engineers, researchers, and privacy-focused deployments.
- 🖥️ Fully Local Execution – No external APIs, complete data privacy
- ⚡ Optimized Inference Engine – Powered by FastAPI + vLLM
- 🎮 GPU & CPU Support – Automatically adaptable to available hardware
- 💾 Low Resource Friendly – Runs smoothly on 8GB RAM systems
- 🌍 Multilingual Model Compatibility – Supports Hugging Face models
- 💬 Modern Web Chat UI – Clean, responsive interface with Markdown rendering
- 🧠 Reasoning Trace Support – Displays
<think>outputs when enabled - 🎨 Dark / Light Mode Support
Pucho AI follows a simple and modular data flow:
User Input
→ Frontend (HTML Chat Interface)
→ FastAPI Backend
→ vLLM Inference Engine
→ Locally Downloaded Hugging Face Model
→ Response returned to Frontend
All components run entirely on your local machine, ensuring maximum performance, full privacy, and offline capability.
Pucho-AI/
├── download_model.py # Script to download models from Hugging Face
├── run_llm_server.sh # Script to start the LLM server
├── index.html # Web-based chat UI
├── llm_models/ # Directory for downloaded models
└── requirements.txt # Project dependencies
git clone https://github.com/shib1111111/Pucho-AI.git
cd Pucho-AIpython3 -m venv venvActivate:
Linux / macOS:
source venv/bin/activateWindows (PowerShell):
venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtDownload any compatible Hugging Face model:
python download_model.py --model_name Qwen/Qwen3-0.6B --cache_dir ./llm_modelsArguments:
--model_name→ Model ID from Hugging Face--cache_dir→ Directory for storing models (default:./llm_models)
Models will be stored locally inside:
llm_models/<model_name>/
Make the script executable (Linux/macOS):
chmod +x run_llm_server.shStart the server:
./run_llm_server.shDefault endpoint:
http://0.0.0.0:8000
Option 1: Open index.html directly in a browser
Option 2 (Recommended): Serve via local HTTP server
python3 -m http.server 8001 --bind 0.0.0.0Open:
http://localhost:8001/index.html
If required, update the following inside index.html:
const API_URL = "http://127.0.0.1:8000/v1/chat/completions";
const MODEL_NAME = "./llm_models/Qwen_Qwen3-0.6B";Ensure:
- Backend server is running
- Model path matches your downloaded model
1️⃣ Download Model
python download_model.py --model_name Qwen/Qwen3-0.6B2️⃣ Start Backend
./run_llm_server.sh3️⃣ Start Frontend
python3 -m http.server 80014️⃣ Open in Browser and Start Chatting 🎉
- Optimized for lightweight models (0.5B – 3B parameters)
- GPU acceleration significantly improves generation speed
- CPU mode performs reliably on 8GB RAM systems
- Comparable response latency to hosted chatbot platforms (model-dependent)
Pucho AI ensures:
- No data leaves your machine
- No API keys required
- No third-party tracking
- Fully offline capability
Ideal for research labs, enterprise prototypes, and secure environments.
- Local AI Assistants
- Research & Model Evaluation
- Enterprise AI Prototyping
- Offline AI Systems
- Educational AI Deployments
- Dockerized deployment
- Model selector within UI
- Streaming token responses
- Quantized model support (GGUF, AWQ)
- Authentication & access control
We welcome contributions to enhance MovieMaven. Feel free to open issues or submit pull requests.
This project is licensed under the MIT License.
Thank you for using MovieMaven! Feel free to reach out with any questions or feedback.
✨ --- Designed & made with Love by Shib Kumar Saraf ✨