Skip to content

Latest commit

 

History

History
352 lines (239 loc) · 12.5 KB

File metadata and controls

352 lines (239 loc) · 12.5 KB

Handle: llamacpp
URL: http://localhost:33831

llama

License: MIT Server Conan Center

LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features.

Starting

llamacpp docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time.

# [Optional] Pull the llamacpp
# images ahead of starting the service
harbor pull llamacpp

# Start the  llama.cpp service
harbor up llamacpp

# Tail service logs
harbor logs llamacpp

# Open llamacpp Web UI
harbor open llamacpp
  • Harbor will automatically allocate GPU resources to the container if available, see Capabilities Detection.
  • llamacpp will be connected to aider, anythingllm, boost, chatui, cmdh, opint, optillm, plandex, traefik, webui services when they are running together.

Models

You can find GGUF models to run on HuggingFace here.

# Search for models from the terminal
harbor hf find gguf gemma-4

Downloading models:

# Pull from HuggingFace (with optional quantization tag)
harbor pull unsloth/gemma-4-31B-it-GGUF
harbor pull unsloth/gemma-4-31B-it-GGUF:Q4_K_M

# Pull from HuggingFace via harbor models
harbor models pull unsloth/gemma-4-31B-it-GGUF

Downloaded models are stored in the HuggingFace cache (~/.cache/huggingface) on your local machine.

Listing and removing models:

# List all locally available models (across Ollama, HF, and llama.cpp caches)
harbor models ls

# List as JSON for scripting
harbor models ls --json

# Remove a model
harbor models rm unsloth/gemma-4-31B-it-GGUF

# Remove only a specific quantization
harbor models rm unsloth/gemma-4-31B-it-GGUF:Q4_K_M

See the harbor models reference for full details on cross-source model management.

Configuring which model to run:

# Set model via HuggingFace URL (downloaded on next start if not cached)
harbor llamacpp model https://huggingface.co/user/repo/blob/main/file.gguf

# Or set a path to a local GGUF file
harbor llamacpp gguf /path/to/model.gguf

The server runs one model at a time in single-model mode and must be restarted to switch models.

Multiple models (router mode)

TLDR;

harbor config set llamacpp.model.specifier ""
harbor up llamacpp # shows all your downloaded models

The official llama.cpp server can run in router mode to load and unload multiple models dynamically. In Harbor, this maps to starting the service without a fixed model specifier and using extra args to point to model sources.

Start in router mode

Router mode requires no -m/--hf-repo arguments. Clear the model specifier and restart the service:

# Clear the model specifier (router mode requires no -m / --hf-repo)
harbor config set llamacpp.model.specifier ""

# Or set directly in services/llamacpp/override.env
HARBOR_LLAMACPP_MODEL_SPECIFIER=""

Model sources (official docs → Harbor paths)

  • Cache (default): llama.cpp uses the HuggingFace cache to discover models. In Harbor this is mounted at /root/.cache/huggingface from HARBOR_HF_CACHE.
  • Models directory: place GGUFs under ./services/llamacpp/data/models and point the router to /app/data/models.
  • Preset file: place an INI file at ./services/llamacpp/data/models.ini and point the router to /app/data/models.ini.
# Use a models directory
harbor llamacpp args "--models-dir /app/data/models"

# Or use a preset file
harbor llamacpp args "--models-preset /app/data/models.ini"

# Optional router limits
harbor llamacpp args "--models-dir /app/data/models --models-max 4 --no-models-autoload"

If a model has multiple GGUF shards or an mmproj file for multimodal, place them in a subdirectory (the mmproj file name must start with mmproj).

Routing and model lifecycle

  • POST endpoints route by the model field in the JSON body.
  • GET endpoints use the ?model=... query parameter.
  • Use /models to list known models and /models/load or /models/unload to manage them.
# List known models
curl http://localhost:33831/models

# Load a model
curl -X POST http://localhost:33831/models/load \
	-H "Content-Type: application/json" \
	-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}'

# Unload a model
curl -X POST http://localhost:33831/models/unload \
	-H "Content-Type: application/json" \
	-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}'

# Route a request to a specific model
curl http://localhost:33831/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M","messages":[{"role":"user","content":"Hello"}]}'

Configuration

You can provide additional arguments to the llama.cpp CLI via the HARBOR_LLAMACPP_EXTRA_ARGS. It can be set either with Harbor CLI or in services/llamacpp/override.env.

# See llama.cpp server args
harbor run llamacpp --server --help

# Set the extra arguments
harbor llamacpp args '--max-tokens 1024 -ngl 100'

# Or set directly in services/llamacpp/override.env
HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100"

You can add llamacpp to default services in Harbor:

# Add llamacpp to the default services
# Will always start when running `harbor up`
harbor defaults add llamacpp

# Remove llamacpp from the default services
harbor defaults rm llamacpp

Following options are available via harbor config:

# Legacy llama.cpp cache path (models are now stored in HF cache)
# or relative to $(harbor home)
HARBOR_LLAMACPP_CACHE          ~/.cache/llama.cpp

# The port on the host machine where the llama.cpp service
# will be available
HARBOR_LLAMACPP_HOST_PORT      33831

# Docker images for each detected capability
HARBOR_LLAMACPP_IMAGE_CPU      ghcr.io/ggml-org/llama.cpp:server
HARBOR_LLAMACPP_IMAGE_NVIDIA   ghcr.io/ggml-org/llama.cpp:server-cuda
HARBOR_LLAMACPP_IMAGE_ROCM     ghcr.io/ggml-org/llama.cpp:server-rocm

To switch the base image, set one or more of these variables to your preferred image/tag. Harbor picks the variable that matches detected capability (CPU, NVIDIA, or ROCM), so you can override just one target without affecting others. For example, use harbor config set llamacpp.image.nvidia ghcr.io/your-org/llama.cpp:server-cuda to keep CPU/ROCm defaults while using a custom NVIDIA image.

Building from Source

Pre-built llama.cpp Docker images can lag behind GitHub releases. When a new model drops and the updated image isn't available yet, you can build directly from the llama.cpp repository.

Harbor uses the build capability and cross-files to overlay a build: section onto the existing llamacpp compose config. GPU-variant Dockerfiles are selected automatically based on detected hardware.

Enable and build

# Enable building from source
harbor llamacpp build on

# (Optional) Pin to a specific release tag, branch, or commit
harbor llamacpp build ref b5678

# Build the image (auto-detects GPU, picks correct Dockerfile)
harbor build llamacpp

# Start as usual
harbor up llamacpp

The build tags the image with the same name as the pre-built image, so the rest of Harbor (cross-service integrations, model config, etc.) works unchanged.

Switch back to pre-built images

# Disable build mode
harbor llamacpp build off

# Pull the official pre-built image to replace the local build
harbor pull llamacpp

# Start with the pre-built image
harbor up llamacpp

Configuration

# Check current build ref
harbor llamacpp build ref

# Set to a specific tag
harbor llamacpp build ref b5678

# Set to a branch
harbor llamacpp build ref master

# Set to a commit SHA
harbor llamacpp build ref abc123def

# Check current build capability status
harbor config get capabilities.default

The build uses the official Dockerfiles from the llama.cpp repository (.devops/llama-server.Dockerfile, .devops/llama-server-cuda.Dockerfile, .devops/llama-server-rocm.Dockerfile). If llama.cpp renames these files in the future, override them in the compose cross-files at services/compose.x.llamacpp.build.yml and its GPU variants.

llama.cpp with Strix Halo

AMD Strix Halo (gfx1151 / Radeon 8060S) is a unified-memory APU. Harbor can run llama.cpp on it through either ROCm or Vulkan. For Qwen3.6 MoE GGUF models, local testing showed the best single-stream decode performance with Vulkan/RADV.

Recommended path: Vulkan/RADV

Harbor detects Strix Halo as an AMD/ROCm target, so put the Vulkan image in the ROCm image slot and force the RADV ICD:

harbor config set llamacpp.image.rocm ghcr.io/ggml-org/llama.cpp:full-vulkan
harbor env llamacpp AMD_VULKAN_ICD RADV
harbor config set llamacpp.model.specifier ""

The full-vulkan image uses the llama.cpp wrapper entrypoint, so the server command starts with --server instead of llama-server:

harbor llamacpp args '--server -m /app/models/path/to/model.gguf -ngl 99 -fa 1 -c 16384 -np 1 --cache-ram 0 --no-mmproj'
harbor up llamacpp

Harbor mounts the host Hugging Face cache at /app/models inside the container. To convert a downloaded host path to a container path, replace the ~/.cache/huggingface prefix with /app/models.

Qwen3.6-35B-A3B Q8 example

harbor hf download unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-Q8_0.gguf

MODEL_PATH=$(harbor find Qwen3.6-35B-A3B-Q8_0.gguf | head -n1 | sed "s#^$HOME/.cache/huggingface#/app/models#")

harbor config set llamacpp.image.rocm ghcr.io/ggml-org/llama.cpp:full-vulkan
harbor env llamacpp AMD_VULKAN_ICD RADV
harbor config set llamacpp.model.specifier ""
harbor llamacpp args "--server -m ${MODEL_PATH} -ngl 99 -fa 1 -c 16384 -np 1 --cache-ram 0 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00 --no-mmproj"

harbor up llamacpp

Local benchmark notes

These numbers were measured on a Ryzen AI Max+ 395 / Radeon 8060S / 128 GiB Strix Halo system with -ngl 99 and -fa 1.

Setup Result
Qwen3.6-35B-A3B Q8_0, Vulkan/RADV, llama-bench -p 512 -n 128 -r 3 tg128 ≈ 49 tok/s
Qwen3.6-35B-A3B Q8_0, Harbor OpenAI server, 256-token response decode ≈ 48 tok/s
Qwen3.6-35B-A3B Q8_0, ROCm llama.cpp tg128 ≈ 36 tok/s
Qwen3.6-35B-A3B Q4_0-class GGUF, Vulkan/RADV tg128 ≈ 75 tok/s
Qwen3.6-35B-A3B IQ4_NL, Vulkan/RADV tg128 ≈ 61 tok/s
Qwen3.6-35B-A3B Q4_K_M / UD-Q4_K_M, Vulkan/RADV tg128 ≈ 56 tok/s

Q4_0 is the path that reproduces ~75 tok/s decode. Q8_0 is higher quality and measured around ~49 tok/s on this hardware. ROCm can have strong prefill, but Vulkan/RADV was faster for single-stream Qwen3.6 MoE decode.

ROCm option

The kyuz0/amd-strix-halo-toolboxes images are still useful for ROCm testing and include Strix Halo/gfx1151 fixes:

harbor config set llamacpp.image.rocm kyuz0/amd-strix-halo-toolboxes:rocm-7.2
harbor llamacpp args 'llama-server -fa 1 --no-mmap -ngl 99 -c 16384 --no-mmproj'

Use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 with ROCm builds that otherwise only see the small BIOS VRAM window.

Kernel parameters

For 128 GiB systems, use a large GTT aperture so GPU-visible unified memory is not capped too low:

amdgpu.gttsize=122880 ttm.pages_limit=335544321 amd_iommu=off transparent_hugepage=never mitigations=off

Verify after reboot:

cat /sys/class/drm/card*/device/mem_info_gtt_total
# Expect about 128849018880 for a 120 GiB GTT aperture

Environment Variables

Follow Harbor's environment variables guide to set arbitrary variables for llamacpp service.

llama.cpp CLIs and scripts

llama.cpp comes with a lot of helper tools/CLIs, which all can be accessed via the harbor exec llamacpp command (once the service is running).

# Show the list of available llama.cpp CLIs
harbor exec llamacpp ls

# See the help for one of the CLIs
harbor exec llamacpp ./scripts/llama-bench --help