refactor: remove embedded llama.cpp, rely on Ollama for local inference#103
Merged
Conversation
The embedded llama.cpp path (provider=local, GGUF download via hf-hub, LazyLlamaCppProvider, metal/cuda/local-llm Cargo features) is removed. Ollama becomes the only supported local backend. ADR-023 captures the decision; ADR-011 is marked superseded. Why: in-process llama.cpp invariant violations were taking down the server, and shipping per-platform builds (Metal, CUDA, CPU) added a lot of build-time complexity for releases. Ollama runs out-of-process with its own platform-tuned builds, and we already speak its HTTP API. Removed: - crates/xpressclaw-core/src/llm/llamacpp.rs - llama-cpp-2, hf-hub, encoding_rs from workspace + core Cargo.toml - local-llm, metal, cuda features (core, server, cli) - AgentLlmConfig.model_path field - The "local" arm in LlmRouter::materialize_provider - DownloadProgress / DownloadStatus / download_status route - use_embedded request flag - resolve_gguf_source and the post-setup GGUF download flow - nvcc/Metal detection in build.sh - "--skip llamacpp" filter in CI - Wizard's Built-in provider button + download progress UI Kept: - provider=ollama with HTTP proxy (LocalProvider) - Reconciler's per-host Ollama pull loop (reconcile_models) - Hardware detection + recommend_model for Ollama tag picking Future (not in this PR): publish XpressAI custom GGUFs (e.g. Qwen3.6-27B-RYS-UD) to Ollama Hub so agents can pull them via the same provider=ollama path. Stacked on #102. cargo fmt, clippy clean, 401 tests pass, svelte-check 0 errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Remove embedded llama.cpp entirely. Local inference goes through Ollama only.
Stacked on #102 — base branch is
fix/per-agent-llm-config-v2, notmain. Land #102 first, then this.Why
Two recurring problems with the embedded path:
llama-cpp-2invariant violations (KV-cache type mismatches, sampler lifetime issues) brought down the whole server process. Tracking these required deep familiarity with llama.cpp internals and was a tax on every other change in the LLM stack.metalfor Apple Silicon,cudafor Linux+NVIDIA, plain CPU otherwise). CUDA in particular needed bespokeCUDA_PATH/RUSTFLAGSsetup inbuild.shbecausefind_cuda_helperonly searches<root>/lib64, missing Debian/Ubuntu multiarch layouts.Ollama solves both: it ships its own platform-tuned builds, runs out-of-process so crashes don't take down the server, and we already speak its HTTP API.
ADR-023 captures the decision; ADR-011 is marked superseded.
What changed
Removed:
crates/xpressclaw-core/src/llm/llamacpp.rs(~1400 lines)llama-cpp-2,hf-hub,encoding_rsworkspace dependencieslocal-llm,metal,cudaCargo features (core, server, cli)AgentLlmConfig.model_pathfield"local"arm inLlmRouter::materialize_providerDownloadProgress/DownloadStatus/download_statusrouteuse_embeddedrequest flag and the wizard's download-progress UIresolve_gguf_sourceand the post-setup GGUF download flownvcc/Metal detection inbuild.sh--skip llamacppfilter in CIKept:
provider: ollamawith HTTP proxy throughLocalProviderreconcile_modelsfrom refactor: per-agent LLM config with logical-name routing #102) — this is now the only model-acquisition pathrecommend_model— feeds the Ollama tag pickerNet: -1948 lines.
Future (not in this PR)
XpressAIhas GGUF builds (e.g.Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF) that outperform base Qwen3.5/3.6. Using them after this PR requires publishing to Ollama Hub first. ADR-023's "Future" section captures it.A
POST /api/setup/pull-modelwith SSE progress would be a nice follow-up so the wizard can show "pulling X of Y MB" without reintroducing the embedded path.Migration
Existing
xpressai.yamlfiles usingprovider: localwill warn at router-build time and the agent will be skipped:Edit the agent's
llmblock:provider: ollama, set the rightmodel:tag,xpressclaw up. The reconciler pulls the model from Ollama on first start.Test plan
cargo fmt --all --checkcleancargo clippy --workspace --all-targetscleancargo test --workspace— 401 passed (down 1 from refactor: per-agent LLM config with logical-name routing #102 because the 6 ignored llamacpp tests are gone with the file)npx svelte-check— 0 errorsprovider=ollamaand a model that isn't yet pulled — verify agent appears, reconciler pulls in background, agent starts replying once pull completesprovider: local— verify the agent is skipped with the documented error./build.shand confirm no GPU SDK is required