Skip to content

hossbit/local-ai-server

Repository files navigation

Local AI Server for Linux

LocalAI local LLM server

Linux Debian-based Red Hat-based llama.cpp API Service License

Run GGUF language models locally with llama.cpp, CPU or GPU acceleration, and llama-swap. The server exposes an OpenAI-compatible API and discovers models placed in the configured install directory, which defaults to ~/ai/models.

What it provides

  • OpenAI-compatible chat and completion endpoints
  • CPU mode plus optional Vulkan, ROCm, OpenVINO, or SYCL llama.cpp backends
  • Automatic discovery of .gguf model files
  • On-demand model loading and switching through llama-swap
  • A systemd user service
  • A localai command for service, model, update, and uninstall tasks

Requirements

  • Ubuntu, Debian, Fedora, RHEL, or another compatible x86-64 Linux system
  • A working CPU install, or a supported GPU/runtime for your selected backend
  • sudo access during installation
  • Enough RAM and VRAM for the model and quantization you choose

The installer uses the known compatible releases llama.cpp b9672 and llama-swap v226. The separate update script checks for newer releases. The default llama.cpp backend is vulkan. For CPU-only machines or simple VM testing, use LLAMA_CPP_BACKEND=cpu; CPU installs use smaller defaults and no GPU offload.

The installer can install required packages with apt-get, dnf, or yum.

Install

One-line install:

curl -fsSL https://hossbit.github.io/localai/install.sh | bash

Custom install directory:

curl -fsSL https://hossbit.github.io/localai/install.sh | LOCALAI_DIR="$HOME/my-ai" bash

Manual install:

git clone https://github.com/hossbit/local-ai-server.git
cd local-ai-server
chmod +x ./*.sh
./install-local-ai.sh

The installer asks where to install LocalAI:

LocalAI install directory [~/ai]:

Press Enter to use the default ~/ai. To choose the path without a prompt, set LOCALAI_DIR:

LOCALAI_DIR=~/my-ai ./install-local-ai.sh

Or pass --dir:

./install-local-ai.sh --dir ~/my-ai

Choose a llama.cpp backend with LLAMA_CPP_BACKEND. The default is vulkan.

LLAMA_CPP_BACKEND=cpu ./install-local-ai.sh
LLAMA_CPP_BACKEND=vulkan ./install-local-ai.sh
LLAMA_CPP_BACKEND=rocm ./install-local-ai.sh
LLAMA_CPP_BACKEND=openvino ./install-local-ai.sh
LLAMA_CPP_BACKEND=sycl-fp16 ./install-local-ai.sh
LLAMA_CPP_BACKEND=sycl-fp32 ./install-local-ai.sh

The installer does not start the server automatically. If no .gguf files are found in the models directory, it prints a warning because chat requests need a model. Add at least one model, then use the service commands below to start and check LocalAI.

To start it automatically when you log in:

systemctl --user enable --now localai

Add a model

Place one or more .gguf files in:

~/ai/models

If you installed somewhere else, use that directory's models folder instead.

For example, with the Hugging Face CLI:

python3 -m pip install --user huggingface_hub
hf auth login

hf download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
  Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
  --local-dir ~/ai/models

Some model repositories require a Hugging Face account and read token. See Hugging Face access tokens.

The model ID exposed by the API is the filename without .gguf. For example:

Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

becomes:

Qwen2.5-Coder-7B-Instruct-Q4_K_M

Use the server

Read the selected port:

PORT=$(cat ~/ai/conf/port)

For a custom install directory:

PORT=$(cat ~/my-ai/conf/port)

List available models:

curl "http://127.0.0.1:${PORT}/v1/models"

Send a chat request:

MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M"

curl "http://127.0.0.1:${PORT}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL}\",
    \"messages\": [
      {\"role\": \"user\", \"content\": \"What is Linux?\"}
    ]
  }"

Python with the OpenAI SDK:

from pathlib import Path
from openai import OpenAI

port = Path.home().joinpath("ai/conf/port").read_text().strip()
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="local")

response = client.chat.completions.create(
    model="Qwen2.5-Coder-7B-Instruct-Q4_K_M",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(response.choices[0].message.content)

The local server does not validate api_key, but OpenAI client libraries usually require a non-empty value.

Service and helper commands

Most users only need these:

Command Purpose
localai start Start the service.
localai stop Unload loaded models, then stop the service.
localai restart Restart the service.
localai status Show service, process, API, and port status.
localai check Check the API and model list.
localai logs Follow LocalAI logs.
localai models List installed .gguf models and show loaded state when the API is reachable.
localai load MODEL Warm one model, for example localai load Qwen2.5-Coder-7B-Instruct-Q4_K_M.
localai unload MODEL Release one loaded model.
localai unload all Release all loaded models.
localai update Update installed components.
localai version Show component versions.
localai uninstall Remove helper files; models are kept by default.

Advanced forms:

Command Purpose
localai check --chat Also send a tiny chat request.
localai load all Warm every model; use only when you have enough memory.
localai update --no-start Update and leave the service stopped.
LLAMA_CPP_BACKEND=cpu localai update Switch backend during update.
LOCALAI_CTX_SIZE=8192 LOCALAI_N_GPU_LAYERS=20 localai start Override runtime settings for one start.
LOCALAI_FLASH_ATTN=1 LOCALAI_PARALLEL=2 localai start Enable optional llama-server tuning for one start.
localai uninstall --remove-models Also remove downloaded models.
localai uninstall --dir ~/my-ai Uninstall from a custom directory.
localai uninstall --remove-llama-swap Also remove the per-user llama-swap binary.

Configuration

Shared defaults live in:

localai.conf          # source default
~/ai/conf/localai.conf # installed copy

This file contains install paths, service names, port settings, and llama.cpp runtime defaults. Environment variables still override the config for one command.

bin/rebuild-config.sh creates conf/config.yaml from every .gguf file in the install directory's models folder. It runs automatically whenever the server starts.

Default runtime settings are:

  • Vulkan and other GPU-capable backends: context size 16384, GPU layers 8
  • CPU backend: context size 4096, GPU layers 0
  • Threads: 6
  • KV cache: q4_0
  • Jinja chat templates: enabled
  • Flash attention, mlock, no-mmap, parallel, batch size, and ubatch size: disabled unless configured
  • Idle model timeout: 900 seconds

Useful llama-server tuning variables:

Variable Effect
LOCALAI_CTX_SIZE Sets --ctx-size.
LOCALAI_N_GPU_LAYERS Sets --n-gpu-layers.
LOCALAI_THREADS Sets -t.
LOCALAI_CACHE_TYPE_K / LOCALAI_CACHE_TYPE_V Set KV cache quantization.
LOCALAI_PARALLEL Adds --parallel when set.
LOCALAI_BATCH_SIZE Adds --batch-size when set.
LOCALAI_UBATCH_SIZE Adds --ubatch-size when set.
LOCALAI_FLASH_ATTN Adds --flash-attn when set to 1.
LOCALAI_JINJA Adds --jinja when set to 1; default is 1.
LOCALAI_MLOCK Adds --mlock when set to 1.
LOCALAI_NO_MMAP Adds --no-mmap when set to 1.
LOCALAI_EXTRA_LLAMA_ARGS Appends extra single-line llama-server flags.

Override any of these for one start with the start command form shown in the service command table, or edit ~/ai/conf/localai.conf to make the setting persistent.

Troubleshooting

Check the configured port and models:

cat ~/ai/conf/port
curl "http://127.0.0.1:$(cat ~/ai/conf/port)/v1/models"

Replace ~/ai with your selected install directory if needed.

Check GPU detection:

~/ai/bin/llama-server --list-devices

Check logs:

tail -n 100 ~/ai/logs/llama-swap.log

If a Hugging Face download returns 401 Unauthorized:

hf auth logout
hf auth login
hf auth whoami

Security

The helper scripts bind llama-swap to 127.0.0.1, so the API is available only on the local machine by default. Do not expose it to a network without adding authentication, TLS, and appropriate firewall rules.

Credits

This project is built on top of:

Special thanks to the maintainers and contributors of these projects.

LocalAI focuses on simplifying installation, configuration, model management, and service deployment for local LLM environments.

Support

If this repo helped you, give it a star

About

Run GGUF language models locally with llama.cpp, llama-swap, Vulkan GPU acceleration, automatic model discovery, systemd integration, and an OpenAI-compatible API.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages