refactor: remove embedded llama.cpp, rely on Ollama for local inference by wmeddie · Pull Request #103 · XpressAI/xpressclaw

wmeddie · 2026-05-07T01:21:09Z

Summary

Remove embedded llama.cpp entirely. Local inference goes through Ollama only.

Stacked on #102 — base branch is fix/per-agent-llm-config-v2, not main. Land #102 first, then this.

Why

Two recurring problems with the embedded path:

Crashes. llama-cpp-2 invariant violations (KV-cache type mismatches, sampler lifetime issues) brought down the whole server process. Tracking these required deep familiarity with llama.cpp internals and was a tax on every other change in the LLM stack.
Hardware-build complexity. Releases needed cross-compiling with the right combination of features per platform (metal for Apple Silicon, cuda for Linux+NVIDIA, plain CPU otherwise). CUDA in particular needed bespoke CUDA_PATH/RUSTFLAGS setup in build.sh because find_cuda_helper only searches <root>/lib64, missing Debian/Ubuntu multiarch layouts.

Ollama solves both: it ships its own platform-tuned builds, runs out-of-process so crashes don't take down the server, and we already speak its HTTP API.

ADR-023 captures the decision; ADR-011 is marked superseded.

What changed

Removed:

crates/xpressclaw-core/src/llm/llamacpp.rs (~1400 lines)
llama-cpp-2, hf-hub, encoding_rs workspace dependencies
local-llm, metal, cuda Cargo features (core, server, cli)
AgentLlmConfig.model_path field
The "local" arm in LlmRouter::materialize_provider
DownloadProgress / DownloadStatus / download_status route
use_embedded request flag and the wizard's download-progress UI
resolve_gguf_source and the post-setup GGUF download flow
nvcc/Metal detection in build.sh
--skip llamacpp filter in CI

Kept:

provider: ollama with HTTP proxy through LocalProvider
The reconciler's per-host Ollama pull loop (reconcile_models from refactor: per-agent LLM config with logical-name routing #102) — this is now the only model-acquisition path
Hardware detection / recommend_model — feeds the Ollama tag picker

Net: -1948 lines.

Future (not in this PR)

XpressAI has GGUF builds (e.g. Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF) that outperform base Qwen3.5/3.6. Using them after this PR requires publishing to Ollama Hub first. ADR-023's "Future" section captures it.

A POST /api/setup/pull-model with SSE progress would be a nice follow-up so the wizard can show "pulling X of Y MB" without reintroducing the embedded path.

Migration

Existing xpressai.yaml files using provider: local will warn at router-build time and the agent will be skipped:

unknown provider type 'local'. Supported providers: openai, anthropic, ollama.

Edit the agent's llm block: provider: ollama, set the right model: tag, xpressclaw up. The reconciler pulls the model from Ollama on first start.

Test plan

cargo fmt --all --check clean
cargo clippy --workspace --all-targets clean
cargo test --workspace — 401 passed (down 1 from refactor: per-agent LLM config with logical-name routing #102 because the 6 ignored llamacpp tests are gone with the file)
npx svelte-check — 0 errors
Manual: setup wizard with provider=ollama and a model that isn't yet pulled — verify agent appears, reconciler pulls in background, agent starts replying once pull completes
Manual: edit existing config to provider: local — verify the agent is skipped with the documented error
Manual: build with ./build.sh and confirm no GPU SDK is required

The embedded llama.cpp path (provider=local, GGUF download via hf-hub, LazyLlamaCppProvider, metal/cuda/local-llm Cargo features) is removed. Ollama becomes the only supported local backend. ADR-023 captures the decision; ADR-011 is marked superseded. Why: in-process llama.cpp invariant violations were taking down the server, and shipping per-platform builds (Metal, CUDA, CPU) added a lot of build-time complexity for releases. Ollama runs out-of-process with its own platform-tuned builds, and we already speak its HTTP API. Removed: - crates/xpressclaw-core/src/llm/llamacpp.rs - llama-cpp-2, hf-hub, encoding_rs from workspace + core Cargo.toml - local-llm, metal, cuda features (core, server, cli) - AgentLlmConfig.model_path field - The "local" arm in LlmRouter::materialize_provider - DownloadProgress / DownloadStatus / download_status route - use_embedded request flag - resolve_gguf_source and the post-setup GGUF download flow - nvcc/Metal detection in build.sh - "--skip llamacpp" filter in CI - Wizard's Built-in provider button + download progress UI Kept: - provider=ollama with HTTP proxy (LocalProvider) - Reconciler's per-host Ollama pull loop (reconcile_models) - Hardware detection + recommend_model for Ollama tag picking Future (not in this PR): publish XpressAI custom GGUFs (e.g. Qwen3.6-27B-RYS-UD) to Ollama Hub so agents can pull them via the same provider=ollama path. Stacked on #102. cargo fmt, clippy clean, 401 tests pass, svelte-check 0 errors.

Base automatically changed from fix/per-agent-llm-config-v2 to main May 7, 2026 01:23

wmeddie merged commit fac60ec into main May 7, 2026

wmeddie deleted the fix/remove-llamacpp branch May 7, 2026 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: remove embedded llama.cpp, rely on Ollama for local inference#103

refactor: remove embedded llama.cpp, rely on Ollama for local inference#103
wmeddie merged 1 commit into
mainfrom
fix/remove-llamacpp

wmeddie commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wmeddie commented May 7, 2026

Summary

Why

What changed

Future (not in this PR)

Migration

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant