Skip to content

refactor: remove embedded llama.cpp, rely on Ollama for local inference#103

Merged
wmeddie merged 1 commit into
mainfrom
fix/remove-llamacpp
May 7, 2026
Merged

refactor: remove embedded llama.cpp, rely on Ollama for local inference#103
wmeddie merged 1 commit into
mainfrom
fix/remove-llamacpp

Conversation

@wmeddie
Copy link
Copy Markdown
Member

@wmeddie wmeddie commented May 7, 2026

Summary

Remove embedded llama.cpp entirely. Local inference goes through Ollama only.

Stacked on #102 — base branch is fix/per-agent-llm-config-v2, not main. Land #102 first, then this.

Why

Two recurring problems with the embedded path:

  1. Crashes. llama-cpp-2 invariant violations (KV-cache type mismatches, sampler lifetime issues) brought down the whole server process. Tracking these required deep familiarity with llama.cpp internals and was a tax on every other change in the LLM stack.
  2. Hardware-build complexity. Releases needed cross-compiling with the right combination of features per platform (metal for Apple Silicon, cuda for Linux+NVIDIA, plain CPU otherwise). CUDA in particular needed bespoke CUDA_PATH/RUSTFLAGS setup in build.sh because find_cuda_helper only searches <root>/lib64, missing Debian/Ubuntu multiarch layouts.

Ollama solves both: it ships its own platform-tuned builds, runs out-of-process so crashes don't take down the server, and we already speak its HTTP API.

ADR-023 captures the decision; ADR-011 is marked superseded.

What changed

Removed:

  • crates/xpressclaw-core/src/llm/llamacpp.rs (~1400 lines)
  • llama-cpp-2, hf-hub, encoding_rs workspace dependencies
  • local-llm, metal, cuda Cargo features (core, server, cli)
  • AgentLlmConfig.model_path field
  • The "local" arm in LlmRouter::materialize_provider
  • DownloadProgress / DownloadStatus / download_status route
  • use_embedded request flag and the wizard's download-progress UI
  • resolve_gguf_source and the post-setup GGUF download flow
  • nvcc/Metal detection in build.sh
  • --skip llamacpp filter in CI

Kept:

Net: -1948 lines.

Future (not in this PR)

XpressAI has GGUF builds (e.g. Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF) that outperform base Qwen3.5/3.6. Using them after this PR requires publishing to Ollama Hub first. ADR-023's "Future" section captures it.

A POST /api/setup/pull-model with SSE progress would be a nice follow-up so the wizard can show "pulling X of Y MB" without reintroducing the embedded path.

Migration

Existing xpressai.yaml files using provider: local will warn at router-build time and the agent will be skipped:

unknown provider type 'local'. Supported providers: openai, anthropic, ollama.

Edit the agent's llm block: provider: ollama, set the right model: tag, xpressclaw up. The reconciler pulls the model from Ollama on first start.

Test plan

  • cargo fmt --all --check clean
  • cargo clippy --workspace --all-targets clean
  • cargo test --workspace — 401 passed (down 1 from refactor: per-agent LLM config with logical-name routing #102 because the 6 ignored llamacpp tests are gone with the file)
  • npx svelte-check — 0 errors
  • Manual: setup wizard with provider=ollama and a model that isn't yet pulled — verify agent appears, reconciler pulls in background, agent starts replying once pull completes
  • Manual: edit existing config to provider: local — verify the agent is skipped with the documented error
  • Manual: build with ./build.sh and confirm no GPU SDK is required

The embedded llama.cpp path (provider=local, GGUF download via hf-hub,
LazyLlamaCppProvider, metal/cuda/local-llm Cargo features) is removed.
Ollama becomes the only supported local backend. ADR-023 captures the
decision; ADR-011 is marked superseded.

Why: in-process llama.cpp invariant violations were taking down the
server, and shipping per-platform builds (Metal, CUDA, CPU) added a lot
of build-time complexity for releases. Ollama runs out-of-process with
its own platform-tuned builds, and we already speak its HTTP API.

Removed:
- crates/xpressclaw-core/src/llm/llamacpp.rs
- llama-cpp-2, hf-hub, encoding_rs from workspace + core Cargo.toml
- local-llm, metal, cuda features (core, server, cli)
- AgentLlmConfig.model_path field
- The "local" arm in LlmRouter::materialize_provider
- DownloadProgress / DownloadStatus / download_status route
- use_embedded request flag
- resolve_gguf_source and the post-setup GGUF download flow
- nvcc/Metal detection in build.sh
- "--skip llamacpp" filter in CI
- Wizard's Built-in provider button + download progress UI

Kept:
- provider=ollama with HTTP proxy (LocalProvider)
- Reconciler's per-host Ollama pull loop (reconcile_models)
- Hardware detection + recommend_model for Ollama tag picking

Future (not in this PR): publish XpressAI custom GGUFs (e.g.
Qwen3.6-27B-RYS-UD) to Ollama Hub so agents can pull them via the same
provider=ollama path.

Stacked on #102. cargo fmt, clippy clean, 401 tests pass, svelte-check
0 errors.
Base automatically changed from fix/per-agent-llm-config-v2 to main May 7, 2026 01:23
@wmeddie wmeddie merged commit fac60ec into main May 7, 2026
@wmeddie wmeddie deleted the fix/remove-llamacpp branch May 7, 2026 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant