____ _ _ _
/ __ \ | | | | | |
| | | |_ _ __ _ _ __ ___| | ___ | |_| |__
| | | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \
| |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | |
\___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_|
[ POWERED BY TURBOQUANT+ | NVIDIA CUDA ]
Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference.
Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM.
Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes.
Quansloth prevents these crashes by:
- 75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant).
- Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090.
- Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang.
📸 Interface Preview
- Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click
.batlauncher. - Linux: Fully Supported (Native).
- macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs).
- TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes.
- Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time.
- Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits.
- Dual-Routing: Auto-scan your local
models/folder, or input custom absolute paths to load any.gguffile. - Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users.
- Windows with WSL2 (Ubuntu) OR native Linux
- NVIDIA GPU with updated drivers
- Miniconda or Anaconda installed
conda create -n quansloth python=3.10 -y
conda activate quanslothgit clone https://github.com/PacifAIst/Quansloth.git
cd Quansloth
pip install -r requirements.txtchmod +x install.sh
./install.shDownload .gguf models (e.g., Llama 3 8B) and place them in:
models/
- Use
Launch_Quansloth.bat - Double-click → auto-launches WSL, Conda, and server
conda activate quansloth
python quansloth_gui.pyhttp://127.0.0.1:7860
- Symmetric (Turbo3) → Best overall compression
- Asymmetric (Q8/Turbo4) → Better for Q4_K_M models
- Monitor Hardware Stats for real-time VRAM savings
- License: This project is licensed under the Apache 2.0 License.
- Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom).
- Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874).
- CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels.
👤 Author
Dr. Manuel Herrador 📧 mherrador@ujaen.es
University of Jaén (UJA) - Spain
Made with ❤️ for the Local AI Community by PacifAIst

