From 40c6e9406666d57a7407165358e3c323357f4468 Mon Sep 17 00:00:00 2001 From: vonwallace Date: Sun, 3 May 2026 13:47:19 -0500 Subject: [PATCH] docs: Add Linux Intel Arc SYCL build and run instructions --- README.md | 772 +++++++++++++----------------------------------------- 1 file changed, 176 insertions(+), 596 deletions(-) diff --git a/README.md b/README.md index 28faf73c254..ce169b34c45 100644 --- a/README.md +++ b/README.md @@ -1,642 +1,222 @@ -> [!IMPORTANT] -> This is a fork of the [Prism-ML fork of llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that is synced to the main llama.cpp repo -> It is not yet ready for production use and should be considered experimental -> -> The primary benefit of this fork is an up-to-date version of llama.cpp that also merges the capabiilities of the Prism-ML fork -> to support the [Bonsai 1-bit models](https://huggingface.co/prism-ml/Bonsai-8B-gguf) -> -> Note: this is not an official fork and is not supported by the Prism-ML team - this is just a personal fork to demo Bonsai until official support is added - - -## How to use this fork - -```bash -# On MacOS -git clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp -cd prism-ml-llama.cpp -cmake -B build && cmake --build build -j -``` - -### You __must__ recode the public Bonsai 1-bit models to work with this fork -```bash -# Download the public Bonsai 1-bit models -wget https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf -O Bonsai-8B.gguf -``` - -### Run the model -```bash -# Llama cli -./build/bin/llama-cli \ - -m Bonsai-8B.gguf \ - -p "Explain quantum computing in simple terms." \ - -n 256 \ - --temp 0.5 \ - --top-p 0.85 \ - --top-k 20 \ - -ngl 99 - -# llama server -./build/bin/llama-server \ - -m Bonsai-8B.gguf \ - --host 0.0.0.0 \ - --port 8080 \ - -ngl 99 - --ctx-size 65536 -``` - -# llama.cpp - -![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png) - -[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) -[![Release](https://img.shields.io/github/v/release/ggml-org/llama.cpp)](https://github.com/ggml-org/llama.cpp/releases) -[![Server](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml) - -[Manifesto](https://github.com/ggml-org/llama.cpp/discussions/205) / [ggml](https://github.com/ggml-org/ggml) / [ops](https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md) - -LLM inference in C/C++ +# llama.cpp (Mintplex Bonsai Fork) -## Recent API changes - -- [Changelog for `libllama` API](https://github.com/ggml-org/llama.cpp/issues/9289) -- [Changelog for `llama-server` REST API](https://github.com/ggml-org/llama.cpp/issues/9291) - -## Hot topics - -- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.** -- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)** -- [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396) -- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers πŸ€—](https://github.com/ggml-org/llama.cpp/discussions/15313) -- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095) -- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md) -- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode -- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim -- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669 -- Hugging Face GGUF editor: [discussion](https://github.com/ggml-org/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor) - ----- - -## Quick start - -Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine: - -- Install `llama.cpp` using [brew, nix or winget](docs/install.md) -- Run with Docker - see our [Docker documentation](docs/docker.md) -- Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases) -- Build from source by cloning this repository - check out [our build guide](docs/build.md) +> **Important** +> This is a fork of the Prism-ML fork of llama.cpp that is synced to the main llama.cpp repo. It is not yet ready for production use and should be considered experimental. +> +> The primary benefit of this fork is an up-to-date version of llama.cpp that also merges the capabilities of the Prism-ML fork to support the Bonsai 1-bit models. +> +> **Note:** this is not an official fork and is not supported by the Prism-ML team - this is just a personal fork to demo Bonsai until official support is added. -Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more. +--- -Example command: +## What PR #1 Adds β€” Intel Arc SYCL Support -```sh -# Use a local model file -llama-cli -m my_model.gguf +Bonsai 8B uses a custom quantization format called **Q1_0_g128** β€” a 1-bit weight +format where every weight is either `+d` or `-d` (stored as a single bit), grouped +in blocks of 128. This format is defined by PrismML and does not exist in upstream +llama.cpp. -# Or download and run a model directly from Hugging Face -llama-cli -hf ggml-org/gemma-3-1b-it-GGUF +The Mintplex fork has CUDA kernels for Q1_0_g128, but the **SYCL/oneAPI backend** +(used by Intel Arc GPUs) had no kernels for this type. Attempting to run Bonsai on +Intel Arc via SYCL crashed immediately: -# Launch OpenAI-compatible API server -llama-server -hf ggml-org/gemma-3-1b-it-GGUF ``` - -## Description - -The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide -range of hardware - locally and in the cloud. - -- Plain C/C++ implementation without any dependencies -- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks -- AVX, AVX2, AVX512 and AMX support for x86 architectures -- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures -- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use -- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA) -- Vulkan and SYCL backend support -- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity - -The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggml-org/ggml) library. - -
-Models - -Typically finetunes of the base models below are supported as well. - -Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md) - -#### Text-only - -- [X] LLaMA πŸ¦™ -- [x] LLaMA 2 πŸ¦™πŸ¦™ -- [x] LLaMA 3 πŸ¦™πŸ¦™πŸ¦™ -- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) -- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral) -- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct) -- [x] [Jamba](https://huggingface.co/ai21labs) -- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon) -- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) -- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne) -- [X] [BERT](https://github.com/ggml-org/llama.cpp/pull/5423) -- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) -- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft) -- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila) -- [X] [Starcoder models](https://github.com/ggml-org/llama.cpp/pull/3187) -- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim) -- [X] [MPT](https://github.com/ggml-org/llama.cpp/pull/3417) -- [X] [Bloom](https://github.com/ggml-org/llama.cpp/pull/3553) -- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi) -- [X] [StableLM models](https://huggingface.co/stabilityai) -- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek) -- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen) -- [x] [PLaMo-13B](https://github.com/ggml-org/llama.cpp/pull/3557) -- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi) -- [x] [PhiMoE](https://github.com/ggml-org/llama.cpp/pull/11003) -- [x] [GPT-2](https://huggingface.co/gpt2) -- [x] [Orion 14B](https://github.com/ggml-org/llama.cpp/pull/5118) -- [x] [InternLM2](https://huggingface.co/models?search=internlm2) -- [x] [CodeShell](https://github.com/WisdomShell/codeshell) -- [x] [Gemma](https://ai.google.dev/gemma) -- [x] [Mamba](https://github.com/state-spaces/mamba) -- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf) -- [x] [Xverse](https://huggingface.co/models?search=xverse) -- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r) -- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion) -- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B) -- [x] [OLMo](https://allenai.org/olmo) -- [x] [OLMo 2](https://allenai.org/olmo) -- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924) -- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330) -- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia) -- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520) -- [x] [Smaug](https://huggingface.co/models?search=Smaug) -- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B) -- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM) -- [x] [Flan T5](https://huggingface.co/models?search=flan-t5) -- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca) -- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat) -- [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e) -- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966) -- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) -- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a) -- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat) -- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a) -- [x] [RWKV-7](https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf) -- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM) -- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1) -- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct) -- [X] [Trillion-7B-preview](https://huggingface.co/trillionlabs/Trillion-7B-preview) -- [x] [Ling models](https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32) -- [x] [LFM2 models](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38) -- [x] [Hunyuan models](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) -- [x] [BailingMoeV2 (Ring/Ling 2.0) models](https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86) - -#### Multimodal - -- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) -- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava) -- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5) -- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V) -- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM) -- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL) -- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM) -- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2) -- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny) -- [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge) -- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) -- [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa) - -
- -
-Bindings - -- Python: [ddh0/easy-llama](https://github.com/ddh0/easy-llama) -- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python) -- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp) -- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp) -- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp) -- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli) -- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm) -- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama) -- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb) -- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs) -- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp) -- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs) -- Rust (automated build from crates.io): [ShelbyJenkins/llm_client](https://github.com/ShelbyJenkins/llm_client) -- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp) -- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html) -- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s) -- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj) -- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn) -- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp) -- Java: [QuasarByte/llama-cpp-jna](https://github.com/QuasarByte/llama-cpp-jna) -- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig) -- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart) -- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama) -- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggml-org/llama.cpp/pull/6326) -- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp) -- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift) -- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama) -- Delphi [Embarcadero/llama-cpp-delphi](https://github.com/Embarcadero/llama-cpp-delphi) -- Go (no CGo needed): [hybridgroup/yzma](https://github.com/hybridgroup/yzma) -- Android: [llama.android](/examples/llama.android) - -
- -
-UIs - -*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)* - -- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT) -- [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary) -- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT) -- [Dot](https://github.com/alexpinel/Dot) (GPL) -- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT) -- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0) -- [janhq/jan](https://github.com/janhq/jan) (AGPL) -- [johnbean393/Sidekick](https://github.com/johnbean393/Sidekick) (MIT) -- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0) -- [KodiBot](https://github.com/firatkiral/kodibot) (GPL) -- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT) -- [LARS](https://github.com/abgulati/LARS) (AGPL) -- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL) -- [LlamaLib](https://github.com/undreamai/LlamaLib) (Apache-2.0) -- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT) -- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT) -- [LMStudio](https://lmstudio.ai/) (proprietary) -- [LocalAI](https://github.com/mudler/LocalAI) (MIT) -- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL) -- [MindMac](https://mindmac.app) (proprietary) -- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT) -- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT) -- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0) -- [nat/openplayground](https://github.com/nat/openplayground) (MIT) -- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT) -- [ollama/ollama](https://github.com/ollama/ollama) (MIT) -- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL) -- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT) -- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT) -- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT) -- [pythops/tenere](https://github.com/pythops/tenere) (AGPL) -- [ramalama](https://github.com/containers/ramalama) (MIT) -- [semperai/amica](https://github.com/semperai/amica) (MIT) -- [withcatai/catai](https://github.com/withcatai/catai) (MIT) -- [Autopen](https://github.com/blackhole89/autopen) (GPL) - -
- -
-Tools - -- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from Hugging Face Hub and convert them to GGML -- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp -- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption -- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage -- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example) -- [unslothai/unsloth](https://github.com/unslothai/unsloth) – πŸ¦₯ exports/saves fine-tuned and trained models to GGUF (Apache-2.0) - -
- -
-Infrastructure - -- [Paddler](https://github.com/intentee/paddler) - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure -- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs -- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly -- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server -- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale -- [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes. -- [LLMKube](https://github.com/defilantech/llmkube) - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal - support" -
- -
-Games - -- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you. - -
- - -## Supported backends - -| Backend | Target devices | -| --- | --- | -| [Metal](docs/build.md#metal-build) | Apple Silicon | -| [BLAS](docs/build.md#blas-build) | All | -| [BLIS](docs/backend/BLIS.md) | All | -| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU | -| [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs | -| [MUSA](docs/build.md#musa) | Moore Threads GPU | -| [CUDA](docs/build.md#cuda) | Nvidia GPU | -| [HIP](docs/build.md#hip) | AMD GPU | -| [ZenDNN](docs/build.md#zendnn) | AMD CPU | -| [Vulkan](docs/build.md#vulkan) | GPU | -| [CANN](docs/build.md#cann) | Ascend NPU | -| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU | -| [IBM zDNN](docs/backend/zDNN.md) | IBM Z & LinuxONE | -| [WebGPU [In Progress]](docs/build.md#webgpu) | All | -| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All | -| [Hexagon [In Progress]](docs/backend/snapdragon/README.md) | Snapdragon | -| [VirtGPU](docs/backend/VirtGPU.md) | VirtGPU APIR | - -## Obtaining and quantizing models - -The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`: - -- [Trending](https://huggingface.co/models?library=gguf&sort=trending) -- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf) - -You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf /[:quant]`. For example: - -```sh -llama-cli -hf ggml-org/gemma-3-1b-it-GGUF +fatal error: unsupport data type=q1_0_g128 +Aborted ``` -By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint. - -After downloading a model, use the CLI tools to run it locally - see below. - -`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo. - -The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`: - -- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes -- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggml-org/llama.cpp/discussions/10123) -- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggml-org/llama.cpp/discussions/9268) -- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669) - -To learn more about model quantization, [read this documentation](tools/quantize/README.md) - -## [`llama-cli`](tools/cli) - -#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality. - --
- Run in conversation mode - - Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME` - - ```bash - llama-cli -m model.gguf - - # > hi, who are you? - # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today? - # - # > what is 1+1? - # Easy peasy! The answer to 1+1 is... 2! - ``` - -
- --
- Run in conversation mode with custom chat template - - ```bash - # use the "chatml" template (use -h to see the list of supported templates) - llama-cli -m model.gguf -cnv --chat-template chatml - - # use a custom template - llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:' - ``` - -
- --
- Constrain the output with a custom grammar - - ```bash - llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:' - - # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"} - ``` - - The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md). - - For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/ - -
- - -## [`llama-server`](tools/server) - -#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs. +PR #1 adds three sets of changes to fix this: --
- Start a local HTTP server with default configuration on port 8080 +**`ggml/src/ggml-sycl/vecdotq.hpp`** β€” The core math kernel. Takes a Q1_0_g128 +weight block and a Q8_1 activation block, computes the dot product correctly +(applying the Q8_1 scale factor per block), and returns the result. This is the +innermost operation of every matrix multiply. - ```bash - llama-server -m model.gguf --port 8080 +**`ggml/src/ggml-sycl/mmvq.cpp`** β€” The dispatch layer. When the SYCL scheduler +needs to multiply a Q1_0_g128 weight matrix by an activation vector, this routes +the call to the correct kernel. Without this the scheduler has no path and aborts. - # Basic web UI can be accessed via browser: http://localhost:8080 - # Chat completion endpoint: http://localhost:8080/v1/chat/completions - ``` +**`ggml/src/ggml-sycl/convert.cpp`** β€” The type conversion path. When SYCL needs +to convert Q1_0_g128 weights to fp16 or fp32 (for KV cache and other operations), +these functions handle it. Without this a second crash occurs during graph setup. -
+**`common/chat-auto-parser-generator.cpp`** β€” A fix for the thinking suppression +bug. Bonsai 8B is built on Qwen3 which has chain-of-thought thinking baked into +its training. When `--reasoning off` is used, the server injects a suppression +block (`\n\n`) into the assistant turn. The original code used +`reasoning.start` as the delimiter even when thinking was disabled, causing the +parser to extract the suppression block as spurious `reasoning_content`. The fix +uses an empty delimiter when `enable_thinking` is false so the suppression block +is treated as a literal prefix and ignored. --
- Support multiple-users and parallel decoding +**Result:** All 37 Bonsai 8B layers offload to the Intel Arc GPU via SYCL/oneAPI +Level Zero, achieving ~46 tok/s generation β€” approximately 24x faster than +CPU-only inference on the same machine. - ```bash - # up to 4 concurrent requests, each with 4096 max context - llama-server -m model.gguf -c 16384 -np 4 - ``` +--- -
- --
- Enable speculative decoding - - ```bash - # the draft.gguf model should be a small variant of the target model.gguf - llama-server -m model.gguf -md draft.gguf - ``` - -
- --
- Serve an embedding model - - ```bash - # use the /embedding endpoint - llama-server -m model.gguf --embedding --pooling cls -ub 8192 - ``` - -
- --
- Serve a reranking model - - ```bash - # use the /reranking endpoint - llama-server -m model.gguf --reranking - ``` - -
- --
- Constrain all outputs with a grammar - - ```bash - # custom grammar - llama-server -m model.gguf --grammar-file grammar.gbnf - - # JSON - llama-server -m model.gguf --grammar-file grammars/json.gbnf - ``` - -
- - -## [`llama-perplexity`](tools/perplexity) - -#### A tool for measuring the [perplexity](tools/perplexity/README.md) [^1] (and other quality metrics) of a model over a given text. - --
- Measure the perplexity over a text file - - ```bash - llama-perplexity -m model.gguf -f file.txt - - # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ... - # Final estimate: PPL = 5.4007 +/- 0.67339 - ``` - -
- --
- Measure KL divergence - - ```bash - # TODO - ``` - -
- -[^1]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity) +## How to use this fork -## [`llama-bench`](tools/llama-bench) +### On macOS -#### Benchmark the performance of the inference for various parameters. +```bash +git clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp +cd prism-ml-llama.cpp +cmake -B build && cmake --build build -j +``` --
- Run default benchmark +### On Linux β€” Intel Arc GPU (SYCL/oneAPI) - ```bash - llama-bench -m model.gguf +> **Note:** The macOS build command above will not enable GPU acceleration on Linux/Intel. +> Intel Arc users must use the oneAPI toolkit for full SYCL GPU offload. - # Output: - # | model | size | params | backend | threads | test | t/s | - # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | - # | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 Β± 20.55 | - # | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 Β± 0.81 | - # - # build: 3e0ba0e60 (4229) - ``` +**Prerequisites:** +- Ubuntu 22.04 / 24.04 +- [Intel oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html) +- Intel Arc GPU with Level Zero driver -
+```bash +# Install Level Zero driver +sudo apt install -y intel-level-zero-gpu level-zero -## [`llama-simple`](examples/simple) +# Clone the repo +git clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp +cd prism-ml-llama.cpp -#### A minimal example for implementing apps with `llama.cpp`. Useful for developers. +# Source oneAPI environment +source /opt/intel/oneapi/setvars.sh --force --
- Basic text completion +# Configure with SYCL enabled, Vulkan disabled +# (Vulkan lacks integer dot product support on Arc β€” use SYCL for full speed) +cmake -B build \ + -DGGML_SYCL=ON \ + -DGGML_VULKAN=OFF \ + -DCMAKE_C_COMPILER=icx \ + -DCMAKE_CXX_COMPILER=icpx \ + -DCMAKE_BUILD_TYPE=Release - ```bash - llama-simple -m model.gguf +# Build +cmake --build build -j$(nproc) +``` - # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of - ``` +## Download the Bonsai model -
+```bash +wget https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf \ + -O Bonsai-8B.gguf +``` +--- -## Contributing +## Run the model -- Contributors can open PRs -- Collaborators will be invited based on contributions -- Maintainers can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch -- Any help with managing issues, PRs and projects is very appreciated! -- See [good first issues](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions -- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information -- Make sure to read this: [Inference at the edge](https://github.com/ggml-org/llama.cpp/discussions/205) -- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532) +### llama-cli (macOS) -## Other documentation +```bash +./build/bin/llama-cli \ + -m Bonsai-8B.gguf \ + -p "Explain quantum computing in simple terms." \ + -n 256 \ + --temp 0.5 \ + --top-p 0.85 \ + --top-k 20 \ + -ngl 99 +``` -- [cli](tools/cli/README.md) -- [completion](tools/completion/README.md) -- [server](tools/server/README.md) -- [GBNF grammars](grammars/README.md) +### llama-cli (Linux β€” Intel Arc via SYCL) -#### Development documentation +```bash +source /opt/intel/oneapi/setvars.sh --force +export ONEAPI_DEVICE_SELECTOR=level_zero:0 +export GGML_SYCL_DEVICE=0 +export ZES_ENABLE_SYSMAN=1 -- [How to build](docs/build.md) -- [Running on Docker](docs/docker.md) -- [Build on Android](docs/android.md) -- [Performance troubleshooting](docs/development/token_generation_performance_tips.md) -- [GGML tips & tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-&-Tricks) +./build/bin/llama-cli \ + -m Bonsai-8B.gguf \ + -p "Explain quantum computing in simple terms." \ + -n 256 \ + --temp 0.5 \ + --top-p 0.85 \ + --top-k 20 \ + -ngl 99 +``` -#### Seminal papers and background on the models +### llama-server (macOS) -If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT: -- LLaMA: - - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) - - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) -- GPT-3 - - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) -- GPT-3.5 / InstructGPT / ChatGPT: - - [Aligning language models to follow instructions](https://openai.com/research/instruction-following) - - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) +```bash +./build/bin/llama-server \ + -m Bonsai-8B.gguf \ + --host 0.0.0.0 \ + --port 8080 \ + -ngl 99 \ + --ctx-size 65536 +``` -## XCFramework -The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, -and macOS. It can be used in Swift projects without the need to compile the -library from source. For example: -```swift -// swift-tools-version: 5.10 -// The swift-tools-version declares the minimum version of Swift required to build this package. +### llama-server (Linux β€” Intel Arc via SYCL) -import PackageDescription +Bonsai GGUFs embed a Qwen3 thinking-capable chat template. To suppress +chain-of-thought output, create a plain ChatML template file first: -let package = Package( - name: "MyLlamaPackage", - targets: [ - .executableTarget( - name: "MyLlamaPackage", - dependencies: [ - "LlamaFramework" - ]), - .binaryTarget( - name: "LlamaFramework", - url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip", - checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab" - ) - ] -) +```bash +cat > chatml-nothink.jinja << 'EOF' +{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<|im_start|>assistant\n'}}{% endif %} +EOF ``` -The above example is using an intermediate build `b5046` of the library. This can be modified -to use a different version by changing the URL and checksum. -## Completions -Command-line completion is available for some environments. +Then launch the server: -#### Bash Completion ```bash -$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash -$ source ~/.llama-completion.bash -``` -Optionally this can be added to your `.bashrc` or `.bash_profile` to load it -automatically. For example: -```console -$ echo "source ~/.llama-completion.bash" >> ~/.bashrc -``` +source /opt/intel/oneapi/setvars.sh --force +export ONEAPI_DEVICE_SELECTOR=level_zero:0 +export GGML_SYCL_DEVICE=0 +export ZES_ENABLE_SYSMAN=1 -## Dependencies +./build/bin/llama-server \ + -m Bonsai-8B.gguf \ + --host 0.0.0.0 \ + --port 8080 \ + -ngl 99 \ + --ctx-size 16384 \ + --reasoning off \ + --chat-template-file chatml-nothink.jinja +``` -- [yhirose/cpp-httplib](https://github.com/yhirose/cpp-httplib) - Single-header HTTP server, used by `llama-server` - MIT license -- [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain -- [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License -- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain -- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain +> **Note:** `--reasoning off` combined with `--chat-template-file` is required +> to suppress thinking mode. `--no-jinja` and `--reasoning-budget 0` alone +> are insufficient β€” the model generates `` tokens regardless due to +> training. + +--- + +## Performance (Intel Arc Pro B50) + +| Metric | Value | +|--------|-------| +| GPU | Intel Arc Pro B50 (BMG G21, 16 GB) | +| Backend | SYCL / oneAPI Level Zero | +| Model | Bonsai-8B Q1_0_g128 (1.08 GB) | +| Layers offloaded | 37 / 37 | +| Prompt eval (cold) | ~51.7 tok/s | +| Prompt eval (warm) | ~98.0 tok/s | +| Generation | ~41.0 tok/s | +| CPU-only baseline | ~1.2 tok/s prompt / ~1.9 tok/s generation | +| Speedup vs CPU | ~43x prompt / ~22x generation | +| 500 token response | ~12.9s total | + +--- + +## Troubleshooting + +**`fatal error: unsupport data type=q1_0_g128`** +You are using a build without the SYCL Q1_0_g128 kernels. Make sure you +built with `-DGGML_SYCL=ON` using the oneAPI compilers. See PR #1 for details. + +**Model generates `` blocks** +Use `--reasoning off` and `--chat-template-file chatml-nothink.jinja` as +shown above. The thinking template is embedded in the GGUF and requires +both flags to suppress. + +**Vulkan finds no GPU / slow inference** +Do not use Vulkan on Intel Arc for Bonsai models β€” Mesa ANV lacks +`VK_KHR_shader_integer_dot_product` for Battlemage, so Q1_0_g128 weights +fall back to CPU. Always use the SYCL build with `ONEAPI_DEVICE_SELECTOR=level_zero:0`.