HF Collections: Bonsai (1-bit) · Ternary-Bonsai (1.58-bit) | Whitepapers: 1-bit Bonsai 8B · Ternary-Bonsai 8B
Other Demos: Bonsai {1, 1.58}-bit GPU Demo · Bonsai WebGPU · Ternary-Bonsai WebGPU · Google Colab Notebook
Using this demo repository you can run Bonsai (1-bit) and Ternary-Bonsai language models locally on Mac (Metal), Linux/Windows (CUDA, Vulkan, ROCm), or CPU.
Q1_0 support for CPU, Metal, CUDA, and Vulkan backends is already merged into upstream llama.cpp. Additional backends (optimized x86 CPU, AMD) are pending. In the meantime, our fork provides a more complete set of backends in one place:
- llama.cpp: PrismML-Eng/llama.cpp — pre-built binaries
- MLX: PrismML-Eng/mlx (branch
prism)
| Backend | Status | PR |
|---|---|---|
| CPU (generic) | ✅ Merged | #21273 |
| Metal | ✅ Merged | #21528 |
| CUDA | ✅ Merged | #21629 |
| Vulkan | ✅ Merged | #21539 (community contribution) |
| CPU (optimized x86) | ✅ Merged | #21636 |
| MLX | ⏳ Pending | mlx#3161 |
Q2_0 is the format we currently use to pack ternary weights (~1.58 bits of information stored in 2 bits). It's a hardware-friendly choice: 2-bit alignment maps cleanly onto Metal and CUDA quantization paths and unlocks fast accelerated kernels, at the cost of being larger than a tight ternary packing.
More compact ternary formats are TBD. llama.cpp already has TQ1_0 and TQ2_0 formats which are conceptually close, but they use group size 256 while Bonsai uses group size 128 so the existing TQ formats don't fit Bonsai weights exactly.
Q2_0 kernels for ternary inference are in our PrismML-Eng/llama.cpp fork (prism branch); upstream PRs are coming next. MLX 2-bit is already supported today via MLX (no fork needed).
| Backend | Status | PR |
|---|---|---|
| CPU (NEON / generic) | prism fork |
9f31ffca; PR coming soon |
| Metal | prism fork |
0eed5340; PR coming soon |
| CUDA | prism fork |
e380897e; PR coming soon |
| CPU (optimized x86) | ⏳ TBD | — |
| Vulkan | ⏳ TBD | — |
| ROCm / HIP | ⏳ TBD | — |
| MLX (2-bit) | Already supported in stock MLX | - |
See community-benchmarks/ for results on different hardware and templates to submit your own.
Two model families are available, each in sizes 8B, 4B, and 1.7B.
Available in GGUF (llama.cpp) and MLX 1-bit formats.
| Model | Format | HuggingFace Repo |
|---|---|---|
| Bonsai-8B | GGUF | prism-ml/Bonsai-8B-gguf |
| Bonsai-8B | MLX | prism-ml/Bonsai-8B-mlx-1bit |
| Bonsai-4B | GGUF | prism-ml/Bonsai-4B-gguf |
| Bonsai-4B | MLX | prism-ml/Bonsai-4B-mlx-1bit |
| Bonsai-1.7B | GGUF | prism-ml/Bonsai-1.7B-gguf |
| Bonsai-1.7B | MLX | prism-ml/Bonsai-1.7B-mlx-1bit |
Set BONSAI_MODEL to choose which size to download and run (default: 8B).
Available in GGUF (Q2_0) and MLX (2-bit) formats. See the Ternary-Bonsai HF collection and the whitepaper.
| Model | Format | HuggingFace Repo |
|---|---|---|
| Ternary-Bonsai-8B | GGUF | prism-ml/Ternary-Bonsai-8B-gguf |
| Ternary-Bonsai-8B | MLX (2-bit) | prism-ml/Ternary-Bonsai-8B-mlx-2bit |
| Ternary-Bonsai-4B | GGUF | prism-ml/Ternary-Bonsai-4B-gguf |
| Ternary-Bonsai-4B | MLX (2-bit) | prism-ml/Ternary-Bonsai-4B-mlx-2bit |
| Ternary-Bonsai-1.7B | GGUF | prism-ml/Ternary-Bonsai-1.7B-gguf |
| Ternary-Bonsai-1.7B | MLX (2-bit) | prism-ml/Ternary-Bonsai-1.7B-mlx-2bit |
Set BONSAI_FAMILY=ternary to download and run this family (default family is bonsai).
Both variables are optional. If you set neither, the default is Bonsai-8B (1-bit, 8 billion parameters) — that's what plain ./setup.sh downloads and runs. They're read by setup.sh, setup.ps1, download_models.sh, and every run_* / start_* script (Linux, macOS, and Windows).
| Variable | Default | Valid values | Purpose |
|---|---|---|---|
BONSAI_FAMILY |
bonsai |
bonsai, ternary, all |
Model family. bonsai = 1-bit Bonsai; ternary = 1.58-bit Ternary-Bonsai. all expands to both families (setup/download only). |
BONSAI_MODEL |
8B |
8B, 4B, 1.7B, all |
Model size. all expands to all three sizes (setup/download only). |
all is only valid for setup.sh / setup.ps1 / download_models.sh — the run/server scripts need a concrete family/size.
Combine them freely:
./setup.sh # Bonsai-8B (default)
BONSAI_MODEL=1.7B ./setup.sh # Bonsai-1.7B
BONSAI_FAMILY=ternary ./setup.sh # Ternary-Bonsai-8B
BONSAI_FAMILY=ternary BONSAI_MODEL=4B ./setup.sh # Ternary-Bonsai-4B
BONSAI_MODEL=all ./setup.sh # All 3 Bonsai sizes
BONSAI_FAMILY=all BONSAI_MODEL=all ./setup.sh # Full matrix (6 downloads)git clone https://github.com/PrismML-Eng/Bonsai-demo.git
cd Bonsai-demo
# (Optional) Choose a model size: 8B (default), 4B, or 1.7B
export BONSAI_MODEL=8B
# One command does everything: installs deps, downloads models + binaries
./setup.shgit clone https://github.com/PrismML-Eng/Bonsai-demo.git
cd Bonsai-demo
# (Optional) Choose a model size: 8B (default), 4B, or 1.7B
$env:BONSAI_MODEL = "8B"
# Run setup
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\setup.ps1You can switch between 1-bit (default) and Ternary (1.58-bit) families, and different model sizes instantly:
# run Ternary-Bonsai 4B
BONSAI_FAMILY=ternary BONSAI_MODEL=4B ./scripts/download_models.sh
BONSAI_FAMILY=ternary BONSAI_MODEL=4B ./scripts/run_llama.sh -p "Who are you?"for Windows:
$env:BONSAI_FAMILY="ternary"; $env:BONSAI_MODEL="4B"
.\setup.ps1
.\scripts\run_llama.ps1 -p "Who are you?"The setup script handles everything for you, even on a fresh machine:
- Checks/installs system deps — Xcode CLT on macOS, build-essential on Linux
- Installs uv — fast Python package manager (user-local, not global)
- Creates a Python venv and runs
uv sync— installs cmake, ninja, huggingface-cli frompyproject.toml - Downloads models from HuggingFace (needs
PRISM_HF_TOKENwhile repos are private) - Downloads pre-built binaries from GitHub Release (or builds from source if you prefer)
- Builds MLX from source (macOS only) — clones our fork, then
uv sync --extra mlxfor the full ML stack
Re-running setup.sh is safe — it skips already-completed steps.
All run scripts respect BONSAI_MODEL (default 8B). Set it to run a different size:
./scripts/run_llama.sh -p "What is the capital of France?"
# Run a different model size
BONSAI_MODEL=4B ./scripts/run_llama.sh -p "Who are you? Introduce yourself in haiku".\scripts\run_llama.ps1 -p "What is the capital of France?"
# Run a different model size
$env:BONSAI_MODEL = "4B"
.\scripts\run_llama.ps1 -p "Who are you? Introduce yourself in haiku"source .venv/bin/activate
./scripts/run_mlx.sh -p "What is the capital of France?"Start llama-server with its built-in chat UI:
./scripts/start_llama_server.sh # http://localhost:8080
# Serve a different model size
BONSAI_MODEL=4B ./scripts/start_llama_server.shFor Windows PowerShell:
.\scripts\start_llama_server.ps1The 8B model supports up to 65,536 tokens.
By default the scripts pass -c 0, which lets llama.cpp's --fit automatically size the KV cache to your available memory (no pre-allocation waste). If your build doesn't support -c 0, the scripts fall back to a safe value based on system RAM:
Estimates for Bonsai-8B (weights + KV cache + activations):
| Context Size | Est. Memory Usage |
|---|---|
| 8,192 tokens | ~2.5 GB |
| 32,768 tokens | ~5.9 GB |
| 65,536 tokens | ~10.5 GB |
Override with: ./scripts/run_llama.sh -c 8192 -p "Your prompt"
Open WebUI provides a ChatGPT-like browser interface. It auto-starts the backend servers if they're not already running. Ctrl+C stops everything.
# Install (heavy — separate from base deps)
source .venv/bin/activate
uv pip install open-webui
# One command — starts backends + opens http://localhost:9090
./scripts/start_openwebui.shIf you prefer to build llama.cpp from source instead of using pre-built binaries:
./scripts/build_mac.shClones PrismML-Eng/llama.cpp, builds with Metal, outputs to bin/mac/.
./scripts/build_mac.shThe script auto-detects Intel vs Apple Silicon. On Intel Macs, it builds with -DGGML_METAL=OFF (CPU only). MLX is also skipped automatically since it requires Apple Silicon.
./scripts/build_cpu_linux.shBuilds a CPU-only binary with no GPU dependencies. Works on both x64 and arm64. Outputs to bin/cpu/.
./scripts/build_cuda_linux.shAuto-detects CUDA version. Pass --cuda-path /usr/local/cuda-12.8 to use a specific toolkit.
# Install Vulkan SDK first (e.g. sudo apt install libvulkan-dev glslc)
git clone -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build -j$(nproc)
# Binaries in build/bin/# Requires ROCm toolkit (hipcc)
git clone -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIP=ON
cmake --build build -j$(nproc)
# Binaries in build/bin/.\scripts\build_cuda_windows.ps1Auto-detects CUDA toolkit. Pass -CudaPath "C:\path\to\cuda" to use a specific version.
Requires Visual Studio Build Tools (or full Visual Studio) and CUDA toolkit.
git clone -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# Binaries in build\bin\Release\Requires Visual Studio Build Tools or full Visual Studio with C++ workload.
All binaries are available from the GitHub Release:
| Platform |
|---|
| macOS Apple Silicon (arm64) |
| macOS Apple Silicon (KleidiAI) |
| macOS Intel (x64) |
| Linux x64 (CPU) |
| Linux arm64 (CPU) |
| Linux x64 (CUDA 12.4) |
| Linux x64 (CUDA 12.8) |
| Linux x64 (Vulkan) |
| Linux arm64 (Vulkan) |
| Linux x64 (ROCm 7.2) |
| Windows x64 (CPU) |
| Windows arm64 (CPU) |
| Windows x64 (CUDA 12.4) |
| Windows x64 (Vulkan) |
| Windows x64 (HIP/ROCm) |
| iOS (XCFramework) |
After setup, the directory looks like this:
Bonsai-demo/
├── README.md
├── setup.sh # macOS/Linux setup
├── setup.ps1 # Windows setup
├── pyproject.toml # Python dependencies
├── scripts/
│ ├── common.sh # Shared helpers + BONSAI_MODEL
│ ├── download_models.sh # HuggingFace download
│ ├── download_binaries.sh # GitHub release download
│ ├── run_llama.sh # llama.cpp (auto-detects Mac/Linux)
│ ├── run_llama.ps1 # llama.cpp (Windows PowerShell)
│ ├── run_mlx.sh # MLX inference
│ ├── mlx_generate.py # MLX Python script
│ ├── start_llama_server.sh # llama.cpp server (port 8080)
│ ├── start_llama_server.ps1 # llama.cpp server (Windows PowerShell)
│ ├── start_mlx_server.sh # MLX server (port 8081)
│ ├── start_openwebui.sh # Open WebUI + auto-starts backends
│ ├── build_mac.sh # Build llama.cpp for Mac
│ ├── build_cpu_linux.sh # Build llama.cpp for Linux (CPU only)
│ ├── build_cuda_linux.sh # Build llama.cpp for Linux CUDA
│ └── build_cuda_windows.ps1 # Build llama.cpp for Windows CUDA
├── models/ # ← downloaded by setup
│ ├── gguf/
│ │ ├── 8B/ # GGUF 8B model
│ │ ├── 4B/ # GGUF 4B model
│ │ └── 1.7B/ # GGUF 1.7B model
│ ├── Bonsai-8B-mlx/ # MLX 8B model (macOS)
│ ├── Bonsai-4B-mlx/ # MLX 4B model (macOS)
│ └── Bonsai-1.7B-mlx/ # MLX 1.7B model (macOS)
├── bin/ # ← downloaded or built by setup
│ ├── mac/ # macOS binaries (Metal or CPU)
│ ├── cuda/ # CUDA binaries (Linux/Windows)
│ ├── cpu/ # CPU-only binaries (Linux/Windows)
│ ├── vulkan/ # Vulkan binaries
│ ├── rocm/ # ROCm binaries (AMD Linux)
│ └── hip/ # HIP binaries (AMD Windows)
├── mlx/ # ← cloned by setup (macOS)
└── .venv/ # ← created by setup
Items marked with ← are created at setup time and excluded from git.
Symptom: cmake --build hangs, the system becomes unresponsive, or the build process is killed with an OOM error when building llama.cpp from source with CUDA enabled.
Cause: Compiling CUDA kernels is memory-intensive — each parallel compile job can consume several GB of GPU VRAM and/or system RAM. Running make -j$(nproc) on a machine with a low-VRAM GPU (< 16 GB) or limited system RAM can exhaust available memory.
How the build scripts handle this: build_cuda_linux.sh and build_cuda_windows.ps1 automatically detect the GPU's VRAM before building. If the maximum detected VRAM is less than 16 GB, the scripts cap parallelism at -j 2 instead of using all logical CPU cores. You will see a message like:
Detected GPU VRAM: 8.0 GB (< 16 GB) -- limiting CUDA build to -j 2
Manual override: If you still encounter OOM errors, reduce parallelism further by editing the build invocation in the relevant script, or close other GPU-heavy applications before building.