Handle:
llamacpp
URL: http://localhost:33831
LLM inference in C/C++. Allows to bypass Ollama release cycle when needed - to get access to the latest models or features.
llamacpp docker image is quite large due to dependency on CUDA and other libraries. You might want to pull it ahead of time.
# [Optional] Pull the llamacpp
# images ahead of starting the service
harbor pull llamacpp
# Start the llama.cpp service
harbor up llamacpp
# Tail service logs
harbor logs llamacpp
# Open llamacpp Web UI
harbor open llamacpp- Harbor will automatically allocate GPU resources to the container if available, see Capabilities Detection.
llamacppwill be connected toaider,anythingllm,boost,chatui,cmdh,opint,optillm,plandex,traefik,webuiservices when they are running together.
You can find GGUF models to run on HuggingFace here.
# Search for models from the terminal
harbor hf find gguf gemma-4Downloading models:
# Pull from HuggingFace (with optional quantization tag)
harbor pull unsloth/gemma-4-31B-it-GGUF
harbor pull unsloth/gemma-4-31B-it-GGUF:Q4_K_M
# Pull from HuggingFace via harbor models
harbor models pull unsloth/gemma-4-31B-it-GGUFDownloaded models are stored in the HuggingFace cache (~/.cache/huggingface) on your local machine.
Listing and removing models:
# List all locally available models (across Ollama, HF, and llama.cpp caches)
harbor models ls
# List as JSON for scripting
harbor models ls --json
# Remove a model
harbor models rm unsloth/gemma-4-31B-it-GGUF
# Remove only a specific quantization
harbor models rm unsloth/gemma-4-31B-it-GGUF:Q4_K_MSee the harbor models reference for full details on cross-source model management.
Configuring which model to run:
# Set model via HuggingFace URL (downloaded on next start if not cached)
harbor llamacpp model https://huggingface.co/user/repo/blob/main/file.gguf
# Or set a path to a local GGUF file
harbor llamacpp gguf /path/to/model.ggufThe server runs one model at a time in single-model mode and must be restarted to switch models.
TLDR;
harbor config set llamacpp.model.specifier ""
harbor up llamacpp # shows all your downloaded modelsThe official llama.cpp server can run in router mode to load and unload multiple models dynamically. In Harbor, this maps to starting the service without a fixed model specifier and using extra args to point to model sources.
Start in router mode
Router mode requires no -m/--hf-repo arguments. Clear the model specifier and restart the service:
# Clear the model specifier (router mode requires no -m / --hf-repo)
harbor config set llamacpp.model.specifier ""
# Or set directly in services/llamacpp/override.env
HARBOR_LLAMACPP_MODEL_SPECIFIER=""Model sources (official docs → Harbor paths)
- Cache (default): llama.cpp uses the HuggingFace cache to discover models. In Harbor this is mounted at
/root/.cache/huggingfacefromHARBOR_HF_CACHE. - Models directory: place GGUFs under
./services/llamacpp/data/modelsand point the router to/app/data/models. - Preset file: place an INI file at
./services/llamacpp/data/models.iniand point the router to/app/data/models.ini.
# Use a models directory
harbor llamacpp args "--models-dir /app/data/models"
# Or use a preset file
harbor llamacpp args "--models-preset /app/data/models.ini"
# Optional router limits
harbor llamacpp args "--models-dir /app/data/models --models-max 4 --no-models-autoload"If a model has multiple GGUF shards or an mmproj file for multimodal, place them in a subdirectory (the mmproj file name must start with mmproj).
Routing and model lifecycle
- POST endpoints route by the
modelfield in the JSON body. - GET endpoints use the
?model=...query parameter. - Use
/modelsto list known models and/models/loador/models/unloadto manage them.
# List known models
curl http://localhost:33831/models
# Load a model
curl -X POST http://localhost:33831/models/load \
-H "Content-Type: application/json" \
-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}'
# Unload a model
curl -X POST http://localhost:33831/models/unload \
-H "Content-Type: application/json" \
-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"}'
# Route a request to a specific model
curl http://localhost:33831/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"ggml-org/gemma-3-4b-it-GGUF:Q4_K_M","messages":[{"role":"user","content":"Hello"}]}'You can provide additional arguments to the llama.cpp CLI via the HARBOR_LLAMACPP_EXTRA_ARGS. It can be set either with Harbor CLI or in services/llamacpp/override.env.
# See llama.cpp server args
harbor run llamacpp --server --help
# Set the extra arguments
harbor llamacpp args '--max-tokens 1024 -ngl 100'
# Or set directly in services/llamacpp/override.env
HARBOR_LLAMACPP_EXTRA_ARGS="--max-tokens 1024 -ngl 100"You can add llamacpp to default services in Harbor:
# Add llamacpp to the default services
# Will always start when running `harbor up`
harbor defaults add llamacpp
# Remove llamacpp from the default services
harbor defaults rm llamacppFollowing options are available via harbor config:
# Legacy llama.cpp cache path (models are now stored in HF cache)
# or relative to $(harbor home)
HARBOR_LLAMACPP_CACHE ~/.cache/llama.cpp
# The port on the host machine where the llama.cpp service
# will be available
HARBOR_LLAMACPP_HOST_PORT 33831
# Docker images for each detected capability
HARBOR_LLAMACPP_IMAGE_CPU ghcr.io/ggml-org/llama.cpp:server
HARBOR_LLAMACPP_IMAGE_NVIDIA ghcr.io/ggml-org/llama.cpp:server-cuda
HARBOR_LLAMACPP_IMAGE_ROCM ghcr.io/ggml-org/llama.cpp:server-rocmTo switch the base image, set one or more of these variables to your preferred image/tag. Harbor picks the variable that matches detected capability (CPU, NVIDIA, or ROCM), so you can override just one target without affecting others. For example, use harbor config set llamacpp.image.nvidia ghcr.io/your-org/llama.cpp:server-cuda to keep CPU/ROCm defaults while using a custom NVIDIA image.
Pre-built llama.cpp Docker images can lag behind GitHub releases. When a new model drops and the updated image isn't available yet, you can build directly from the llama.cpp repository.
Harbor uses the build capability and cross-files to overlay a build: section onto the existing llamacpp compose config. GPU-variant Dockerfiles are selected automatically based on detected hardware.
Enable and build
# Enable building from source
harbor llamacpp build on
# (Optional) Pin to a specific release tag, branch, or commit
harbor llamacpp build ref b5678
# Build the image (auto-detects GPU, picks correct Dockerfile)
harbor build llamacpp
# Start as usual
harbor up llamacppThe build tags the image with the same name as the pre-built image, so the rest of Harbor (cross-service integrations, model config, etc.) works unchanged.
Switch back to pre-built images
# Disable build mode
harbor llamacpp build off
# Pull the official pre-built image to replace the local build
harbor pull llamacpp
# Start with the pre-built image
harbor up llamacppConfiguration
# Check current build ref
harbor llamacpp build ref
# Set to a specific tag
harbor llamacpp build ref b5678
# Set to a branch
harbor llamacpp build ref master
# Set to a commit SHA
harbor llamacpp build ref abc123def
# Check current build capability status
harbor config get capabilities.defaultThe build uses the official Dockerfiles from the llama.cpp repository (.devops/llama-server.Dockerfile, .devops/llama-server-cuda.Dockerfile, .devops/llama-server-rocm.Dockerfile). If llama.cpp renames these files in the future, override them in the compose cross-files at services/compose.x.llamacpp.build.yml and its GPU variants.
AMD Strix Halo (gfx1151 / Radeon 8060S) is a unified-memory APU. Harbor can run llama.cpp on it through either ROCm or Vulkan. For Qwen3.6 MoE GGUF models, local testing showed the best single-stream decode performance with Vulkan/RADV.
Recommended path: Vulkan/RADV
Harbor detects Strix Halo as an AMD/ROCm target, so put the Vulkan image in the ROCm image slot and force the RADV ICD:
harbor config set llamacpp.image.rocm ghcr.io/ggml-org/llama.cpp:full-vulkan
harbor env llamacpp AMD_VULKAN_ICD RADV
harbor config set llamacpp.model.specifier ""The full-vulkan image uses the llama.cpp wrapper entrypoint, so the server command starts with --server instead of llama-server:
harbor llamacpp args '--server -m /app/models/path/to/model.gguf -ngl 99 -fa 1 -c 16384 -np 1 --cache-ram 0 --no-mmproj'
harbor up llamacppHarbor mounts the host Hugging Face cache at /app/models inside the container. To convert a downloaded host path to a container path, replace the ~/.cache/huggingface prefix with /app/models.
Qwen3.6-35B-A3B Q8 example
harbor hf download unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-Q8_0.gguf
MODEL_PATH=$(harbor find Qwen3.6-35B-A3B-Q8_0.gguf | head -n1 | sed "s#^$HOME/.cache/huggingface#/app/models#")
harbor config set llamacpp.image.rocm ghcr.io/ggml-org/llama.cpp:full-vulkan
harbor env llamacpp AMD_VULKAN_ICD RADV
harbor config set llamacpp.model.specifier ""
harbor llamacpp args "--server -m ${MODEL_PATH} -ngl 99 -fa 1 -c 16384 -np 1 --cache-ram 0 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00 --no-mmproj"
harbor up llamacppLocal benchmark notes
These numbers were measured on a Ryzen AI Max+ 395 / Radeon 8060S / 128 GiB Strix Halo system with -ngl 99 and -fa 1.
| Setup | Result |
|---|---|
Qwen3.6-35B-A3B Q8_0, Vulkan/RADV, llama-bench -p 512 -n 128 -r 3 |
tg128 ≈ 49 tok/s |
| Qwen3.6-35B-A3B Q8_0, Harbor OpenAI server, 256-token response | decode ≈ 48 tok/s |
| Qwen3.6-35B-A3B Q8_0, ROCm llama.cpp | tg128 ≈ 36 tok/s |
| Qwen3.6-35B-A3B Q4_0-class GGUF, Vulkan/RADV | tg128 ≈ 75 tok/s |
| Qwen3.6-35B-A3B IQ4_NL, Vulkan/RADV | tg128 ≈ 61 tok/s |
| Qwen3.6-35B-A3B Q4_K_M / UD-Q4_K_M, Vulkan/RADV | tg128 ≈ 56 tok/s |
Q4_0 is the path that reproduces ~75 tok/s decode. Q8_0 is higher quality and measured around ~49 tok/s on this hardware. ROCm can have strong prefill, but Vulkan/RADV was faster for single-stream Qwen3.6 MoE decode.
ROCm option
The kyuz0/amd-strix-halo-toolboxes images are still useful for ROCm testing and include Strix Halo/gfx1151 fixes:
harbor config set llamacpp.image.rocm kyuz0/amd-strix-halo-toolboxes:rocm-7.2
harbor llamacpp args 'llama-server -fa 1 --no-mmap -ngl 99 -c 16384 --no-mmproj'Use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 with ROCm builds that otherwise only see the small BIOS VRAM window.
Kernel parameters
For 128 GiB systems, use a large GTT aperture so GPU-visible unified memory is not capped too low:
amdgpu.gttsize=122880 ttm.pages_limit=335544321 amd_iommu=off transparent_hugepage=never mitigations=offVerify after reboot:
cat /sys/class/drm/card*/device/mem_info_gtt_total
# Expect about 128849018880 for a 120 GiB GTT apertureFollow Harbor's environment variables guide to set arbitrary variables for llamacpp service.
llama.cpp comes with a lot of helper tools/CLIs, which all can be accessed via the harbor exec llamacpp command (once the service is running).
# Show the list of available llama.cpp CLIs
harbor exec llamacpp ls
# See the help for one of the CLIs
harbor exec llamacpp ./scripts/llama-bench --help