Run GGUF language models locally with
llama.cpp,
CPU or GPU acceleration, and
llama-swap.
The server exposes an OpenAI-compatible API and discovers models placed in
the configured install directory, which defaults to ~/ai/models.
- OpenAI-compatible chat and completion endpoints
- CPU mode plus optional Vulkan, ROCm, OpenVINO, or SYCL llama.cpp backends
- Automatic discovery of
.ggufmodel files - On-demand model loading and switching through llama-swap
- A systemd user service
- A
localaicommand for service, model, update, and uninstall tasks
- Ubuntu, Debian, Fedora, RHEL, or another compatible x86-64 Linux system
- A working CPU install, or a supported GPU/runtime for your selected backend
sudoaccess during installation- Enough RAM and VRAM for the model and quantization you choose
The installer uses the known compatible releases llama.cpp b9672 and
llama-swap v226. The separate update script checks for newer releases.
The default llama.cpp backend is vulkan. For CPU-only machines or simple VM
testing, use LLAMA_CPP_BACKEND=cpu; CPU installs use smaller defaults and no
GPU offload.
The installer can install required packages with apt-get, dnf, or yum.
One-line install:
curl -fsSL https://hossbit.github.io/localai/install.sh | bashCustom install directory:
curl -fsSL https://hossbit.github.io/localai/install.sh | LOCALAI_DIR="$HOME/my-ai" bashManual install:
git clone https://github.com/hossbit/local-ai-server.git
cd local-ai-server
chmod +x ./*.sh
./install-local-ai.shThe installer asks where to install LocalAI:
LocalAI install directory [~/ai]:
Press Enter to use the default ~/ai. To choose the path without a prompt, set
LOCALAI_DIR:
LOCALAI_DIR=~/my-ai ./install-local-ai.shOr pass --dir:
./install-local-ai.sh --dir ~/my-aiChoose a llama.cpp backend with LLAMA_CPP_BACKEND. The default is vulkan.
LLAMA_CPP_BACKEND=cpu ./install-local-ai.sh
LLAMA_CPP_BACKEND=vulkan ./install-local-ai.sh
LLAMA_CPP_BACKEND=rocm ./install-local-ai.sh
LLAMA_CPP_BACKEND=openvino ./install-local-ai.sh
LLAMA_CPP_BACKEND=sycl-fp16 ./install-local-ai.sh
LLAMA_CPP_BACKEND=sycl-fp32 ./install-local-ai.shThe installer does not start the server automatically. If no .gguf files are
found in the models directory, it prints a warning because chat requests need a
model. Add at least one model, then use the service commands below to start and
check LocalAI.
To start it automatically when you log in:
systemctl --user enable --now localaiPlace one or more .gguf files in:
~/ai/models
If you installed somewhere else, use that directory's models folder instead.
For example, with the Hugging Face CLI:
python3 -m pip install --user huggingface_hub
hf auth login
hf download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
--local-dir ~/ai/modelsSome model repositories require a Hugging Face account and read token. See Hugging Face access tokens.
The model ID exposed by the API is the filename without .gguf. For example:
Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
becomes:
Qwen2.5-Coder-7B-Instruct-Q4_K_M
Read the selected port:
PORT=$(cat ~/ai/conf/port)For a custom install directory:
PORT=$(cat ~/my-ai/conf/port)List available models:
curl "http://127.0.0.1:${PORT}/v1/models"Send a chat request:
MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M"
curl "http://127.0.0.1:${PORT}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"${MODEL}\",
\"messages\": [
{\"role\": \"user\", \"content\": \"What is Linux?\"}
]
}"Python with the OpenAI SDK:
from pathlib import Path
from openai import OpenAI
port = Path.home().joinpath("ai/conf/port").read_text().strip()
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="local")
response = client.chat.completions.create(
model="Qwen2.5-Coder-7B-Instruct-Q4_K_M",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)The local server does not validate api_key, but OpenAI client libraries
usually require a non-empty value.
Most users only need these:
| Command | Purpose |
|---|---|
localai start |
Start the service. |
localai stop |
Unload loaded models, then stop the service. |
localai restart |
Restart the service. |
localai status |
Show service, process, API, and port status. |
localai check |
Check the API and model list. |
localai logs |
Follow LocalAI logs. |
localai models |
List installed .gguf models and show loaded state when the API is reachable. |
localai load MODEL |
Warm one model, for example localai load Qwen2.5-Coder-7B-Instruct-Q4_K_M. |
localai unload MODEL |
Release one loaded model. |
localai unload all |
Release all loaded models. |
localai update |
Update installed components. |
localai version |
Show component versions. |
localai uninstall |
Remove helper files; models are kept by default. |
Advanced forms:
| Command | Purpose |
|---|---|
localai check --chat |
Also send a tiny chat request. |
localai load all |
Warm every model; use only when you have enough memory. |
localai update --no-start |
Update and leave the service stopped. |
LLAMA_CPP_BACKEND=cpu localai update |
Switch backend during update. |
LOCALAI_CTX_SIZE=8192 LOCALAI_N_GPU_LAYERS=20 localai start |
Override runtime settings for one start. |
LOCALAI_FLASH_ATTN=1 LOCALAI_PARALLEL=2 localai start |
Enable optional llama-server tuning for one start. |
localai uninstall --remove-models |
Also remove downloaded models. |
localai uninstall --dir ~/my-ai |
Uninstall from a custom directory. |
localai uninstall --remove-llama-swap |
Also remove the per-user llama-swap binary. |
Shared defaults live in:
localai.conf # source default
~/ai/conf/localai.conf # installed copy
This file contains install paths, service names, port settings, and llama.cpp runtime defaults. Environment variables still override the config for one command.
bin/rebuild-config.sh creates conf/config.yaml from every .gguf file in
the install directory's models folder. It runs automatically whenever the
server starts.
Default runtime settings are:
- Vulkan and other GPU-capable backends: context size
16384, GPU layers8 - CPU backend: context size
4096, GPU layers0 - Threads:
6 - KV cache:
q4_0 - Jinja chat templates: enabled
- Flash attention, mlock, no-mmap, parallel, batch size, and ubatch size: disabled unless configured
- Idle model timeout:
900seconds
Useful llama-server tuning variables:
| Variable | Effect |
|---|---|
LOCALAI_CTX_SIZE |
Sets --ctx-size. |
LOCALAI_N_GPU_LAYERS |
Sets --n-gpu-layers. |
LOCALAI_THREADS |
Sets -t. |
LOCALAI_CACHE_TYPE_K / LOCALAI_CACHE_TYPE_V |
Set KV cache quantization. |
LOCALAI_PARALLEL |
Adds --parallel when set. |
LOCALAI_BATCH_SIZE |
Adds --batch-size when set. |
LOCALAI_UBATCH_SIZE |
Adds --ubatch-size when set. |
LOCALAI_FLASH_ATTN |
Adds --flash-attn when set to 1. |
LOCALAI_JINJA |
Adds --jinja when set to 1; default is 1. |
LOCALAI_MLOCK |
Adds --mlock when set to 1. |
LOCALAI_NO_MMAP |
Adds --no-mmap when set to 1. |
LOCALAI_EXTRA_LLAMA_ARGS |
Appends extra single-line llama-server flags. |
Override any of these for one start with the start command form shown in the
service command table, or edit ~/ai/conf/localai.conf to make the setting
persistent.
Check the configured port and models:
cat ~/ai/conf/port
curl "http://127.0.0.1:$(cat ~/ai/conf/port)/v1/models"Replace ~/ai with your selected install directory if needed.
Check GPU detection:
~/ai/bin/llama-server --list-devicesCheck logs:
tail -n 100 ~/ai/logs/llama-swap.logIf a Hugging Face download returns 401 Unauthorized:
hf auth logout
hf auth login
hf auth whoamiThe helper scripts bind llama-swap to 127.0.0.1, so the API is available only
on the local machine by default. Do not expose it to a network without adding
authentication, TLS, and appropriate firewall rules.
This project is built on top of:
Special thanks to the maintainers and contributors of these projects.
LocalAI focuses on simplifying installation, configuration, model management, and service deployment for local LLM environments.

