refactor: remove embedded llama.cpp, require Ollama for local LLMs by wmeddie · Pull Request #101 · XpressAI/xpressclaw

wmeddie · 2026-05-06T10:36:48Z

Summary

This PR removes support for embedded llama.cpp (GGUF model downloading) from xpressclaw. Local LLM support now requires users to run Ollama separately.

Changes

Removed feature and all llama.cpp dependencies (, , )
Removed config option (users must use Ollama)
Kept config option for Ollama provider (still supported)
Removed embedded model download support (GGUF downloading)
Removed endpoint
Removed and related GGUF repo mapping code
Updated default LLM provider from to in config template
Removed local model pricing tests and zero-cost budget tests

Migration Guide

Users who want to run local LLMs must now:

Install and run Ollama:
Pull a model:
Update xpressclaw config ():
Restart xpressclaw:

Benefits

Before	After
~6GB RAM for embedded llama.cpp	~0.5GB for Ollama
No GPU offloading (CPU-only)	Full GPU passthrough (CUDA/Metal/ROCm)
No GPU passthrough	Full GPU passthrough
No model hot-restart	Ollama supports live model hot-restart
Complex build deps	Zero build deps (just run Ollama)

Testing

✅ - passes
✅ - passes
✅ - passes

All crates build successfully without the feature.

Why: Docker caused persistent friction — networking, image pulls, MCP tool routing, session continuity, opaque agent workspaces. The spike replaces Docker with Wanix as the agent execution environment. What: - Removed bollard dependency and entire docker/ module - Gutted reconciler to simple DB state management (no containers) - Replaced harness-based processor with server-side agentic loop that streams from LLM and executes tools (memory, tasks) directly - Added Wanix runtime assets (wanix.min.js, wanix.wasm) to frontend - Created WanixPanel component and wanix-bridge TypeScript module - Created wanix-server/test-boot.mjs proving Wanix kernel boots headlessly in Node with browser API shims (OPFS, DOM, etc.) - Added AGENTS.md with full project documentation Status: Wanix kernel boots in Node, filesystem mount needs work. The server-side agentic loop works with streaming + tool execution. Docker is completely gone from the codebase.

Wanix kernel runs in headless Chrome via puppeteer-core. The boot sequence allocates a tmpfs capability, mounts it, and binds it at workspace/ — giving agents a writable filesystem without Docker. Validated: file write/read, mkdir, nested directories, task creation. The wanix-server/index.mjs is the headless Wanix host process. Next: wire the xpressclaw agentic loop to execute tools through the Wanix instance via page.evaluate().

The agentic loop now routes filesystem tools (Read, Write, ListDir, MakeDir) through the Wanix headless server at localhost:9100. The Wanix server runs Wanix in headless Chrome and exposes an HTTP API for tool execution via page.evaluate(). To test: 1. Start wanix: node wanix-server/index.mjs 2. Start xpressclaw: ./target/release/xpressclaw up 3. Send a message asking the agent to create a file 4. The agent's Write tool call goes through Wanix

The xpressclaw server now starts the Wanix headless server (node wanix-server/index.mjs) as a managed child process. It starts automatically on server boot and is killed on shutdown. No manual commands needed — just start xpressclaw and Wanix is available for filesystem tool execution. Also fixed: agent always stores a response even when the last LLM turn was tool-only (no text).

The task dispatcher was still trying to connect to a Docker harness on port 0, which always failed. Replaced with the same agentic loop pattern as the conversation processor: stream from LLM router, accumulate tool call deltas, execute tools via Wanix, loop.

- Deleted docker/ directory (files were still on disk) - Memory hooks (recall/remember/consolidate) now call LLM router directly instead of HarnessClient - Removed container_id from agent JSON response - Task dispatcher fully uses LLM router with tool execution - harness.rs still exists as dead code (to be cleaned up)

The Wanix server now handles GET /app/{name}/{path} to serve files directly from the agent's workspace. The xpressclaw app proxy routes /apps/{app_id}/{path} to the Wanix server. Agent creates files → writes to Wanix workspace → publish_app saves to DB → sidebar shows app → clicking app serves from Wanix filesystem. No Docker containers needed to serve apps.

Small models sometimes call the same tool repeatedly in a loop (e.g. create_task with identical args on every turn). Now caches tool results by (name, args) and returns the cached result on duplicates. Also reduced max turns from 15 to 10.

When the model calls the same tool with identical arguments, it now gets an error message telling it the result was already returned and to move on. If an entire turn is nothing but duplicate calls, the loop breaks immediately. When max turns are exhausted without producing text, the agent stores a message explaining it ran out of turns.

Tool calls now require the model to produce reasoning text explaining WHY it's making the call. This text streams to the chat so users see the agent's thinking before each tool execution. Three-layer loop protection: 1. No reasoning text → tool call BLOCKED, model told to explain itself 2. Duplicate reasoning → loop detected, break immediately 3. Duplicate tool args → returns cached result with "ALREADY DONE" This prevents small models from endlessly repeating the same tool call. The reasoning text also makes the chat more informative — users see "I need to create a task for..." instead of just an icon.

The agentic loop now streams reasoning_content from the LLM as <think>...</think> wrapped chunks, which the frontend renders as collapsible thinking blocks. This shows the model's reasoning before tool calls. Text content transitions close the thinking block automatically.

- Text content that precedes tool calls is now buffered and emitted as a <think> block, so it renders as collapsible reasoning — not as regular chat text. Text-only turns (no tools) still stream normally. - The BLOCKED prompt now asks for ONE concise sentence, not verbose explanation. - Model's CoT (reasoning_content) still renders as separate think blocks from the model's thinking.

Appends instruction to every agent's system prompt: "When using tools, write ONE short sentence explaining your intent. No plans, no lists."

The reasoning gate was blocking tool calls when the model produced thinking in reasoning_content but no content text. Small models then got confused by the BLOCKED message and generated garbage. Fix: accept either content text OR reasoning_content as valid reasoning. Drop the BLOCKED retry flow entirely — just use duplicate detection to prevent loops. The system prompt now says "write ONE short sentence stating what you're about to do" with an example.

The duplicate reasoning check was breaking after 1 tool call because the model's CoT thinking produces similar hashes across turns. Now loop detection keys on the actual tool calls (name + args). Only breaks when the exact same set of tool calls repeats — which is the actual loop condition.

The llama-server has limited connection slots. Reqwest's connection pooling kept connections open after streaming, blocking subsequent requests. Adding Connection: close ensures each request fully releases the connection before the next one.

Gemma 4 leaks special tokens like <channel|>, <turn|>, and literal "thought" text into the content field. These are now filtered from both the streaming output and the stored message.

Tasks created via the create_task tool were only inserted into the tasks table but never enqueued in task_queue. The dispatcher only processes queued tasks, so created tasks sat in pending forever. Now create_task also calls queue.enqueue() so the task dispatcher picks it up and executes it.

LlmRouter::new() creates an empty router with no providers. The dispatcher needs build_from_config() to register the OpenAI provider with the correct base_url and API key.

Two proper fixes: 1. Background processor scanner: a 5-second polling loop that checks all conversations for unprocessed messages and spawns processors. This catches messages injected by the task dispatcher, connectors, or any source other than the HTTP handler. Same pattern as the reconciler — reliable polling, no event threading required. 2. System notification styling: messages from the task dispatcher (sender_id="system", message_type="task_wake") now render as small, centered, muted notifications instead of user message bubbles. The SYSTEM: prefix is stripped from display.

Two fixes: 1. Scanner race condition: the background scanner now only picks up unprocessed messages older than 10 seconds. This prevents racing with the HTTP handler which spawns processors immediately. Added oldest_unprocessed_age() to ConversationManager. 2. Task card in conversation: create_task now injects a system message with task JSON so the conversation page renders the task card immediately when a task is created, not just when it completes/fails.

The conversation page renders task cards for messages with message_type='task_status'. The create_task tool was using 'task_created' which the frontend didn't recognize.

4096 tokens was too small — models with large context windows (262k) were hitting finish_reason=length before completing tool call arguments, especially for file writes with long content. 32768 gives enough room for multi-file tool calls while still being conservative. This should really be configurable per-model.

Task status messages in conversations are now updated in-place instead of creating new messages for each status change. The dispatcher looks for an existing task_status message with the same task_id and updates its content. Only creates a new message if none exists. This means the conversation shows one card per task that transitions: pending → in_progress → completed/failed.

Three changes for proper task card UX: 1. TaskCard component: polls the task API every 3s to get live status, subtask progress, and title updates. Stops polling when the task reaches a terminal state (completed/failed). 2. Task card appears after agent response: the create_task tool returns a marker, and the processor injects the card + broadcasts it via the event bus AFTER storing the agent's response. This ensures correct ordering in the conversation. 3. Event bus broadcast: the task card message is broadcast so the frontend sees it appear live without needing a page reload.

Two fixes: 1. Wanix process was piping stdout/stderr but nobody read them. When the pipe buffer filled (Chrome console output), the process blocked. Changed to null stdout + inherit stderr so output goes to the server's terminal without blocking. 2. Task dispatcher now has same fixes as conversation processor: garbage token filtering, tool intent prompt, loop detection via seen_tool_keys set.

When the model leads with tool calls (no text content), the dispatcher was storing "(No response)" because full_content was empty. Now tool execution results are accumulated into full_content, so the task message shows what the agent actually did: **Write**: Wrote 500 bytes to calculator.py **search_memory**: No memories found. Also added loop detection (same as conversation processor) to break when the same tool calls repeat.

The workspace tmpfs mount wasn't ready when the first tool call arrived. Now ensureWorkspace() runs before every tool/app request, mounting a tmpfs if the workspace doesn't exist yet. Idempotent.

The task dispatcher was giving the agent ALL tools including list_tasks, create_task, and search_memory. The agent spent all its turns checking task status instead of doing the actual work. Now tasks get a focused tool set: Write, Read, ListDir, MakeDir, and complete_task. No list_tasks, no create_task, no search_memory. Also: continuation prompt now includes the task title and description so the model knows what to build, and tells it not to call list_tasks. Per-tool call counter breaks after 3 calls to the same tool name.

…isting Two critical fixes: 1. Task dispatcher now includes previous task conversation messages as LLM context. Previously it only sent [system, current_prompt] so the model had no memory of what it already did across turns — causing it to repeat MakeDir and ListDir endlessly. 2. Wanix ListDir with empty/"." path now resolves to "workspace" instead of "workspace/" or "workspace/." which failed.

The TaskCard was showing "0/21 subtasks" because it used the global task counts from the API instead of counting the actual subtask list. Now counts tasks from sub.tasks array directly, so only real subtasks (with matching parent_task_id) are shown.

The health check was showing "Reconnecting..." after a single 3s timeout, which happens regularly when the LLM is processing a long request. Now uses /api/health (lightweight) instead of /api/agents, 5s timeout, and requires 3 consecutive failures before showing the overlay. Prevents flickering during normal operation.

Created the Docker image for running pi-agent inside xpressclaw: - containers/pi-agent/Dockerfile — Node 20 + pi + mcpfs + Go - containers/pi-agent/entrypoint.sh — starts mcpfs mount then pi RPC - containers/pi-agent/AGENTS.md — workspace context for the agent - containers/pi-agent/extensions/xpressclaw-provider/ — pi extension that registers the local llama-server as an OpenAI-compatible provider Validated: pi boots in the container, loads the extension, connects to the local llama-server via the custom provider, and streams RPC events (thinking_delta, text_delta, tool calls) over stdout. RPC protocol: send {"type":"prompt","message":"..."} on stdin, receive JSONL events on stdout.

Adds an opt-in backend that spawns pi-coding-agent inside a c2w Linux VM (WASM) and talks to it over JSONL on stdin/stdout, replacing the Rust-native agent loop when `pi.enabled = true`. - `agents/pi_rpc.rs`: subprocess client for c2w-net + pi WASM, JSONL prompt/event protocol, text/thinking delta parsing. - `routes/mcp_server.rs`: streamable-HTTP MCP endpoint at `/mcp` so the pi container's mcpfs can mount xpressclaw tasks/memory as files. - `config::PiConfig`: wasm_path, c2w_net, wasmtime_shim, NAT-gateway URLs (192.168.127.254), LLM model defaults. - `processor::run_pi_agent` and `dispatcher::call_agent_pi`: branch on config at the top of the loop; old Rust path remains as fallback. - `scripts/build-pi-agent-wasm.sh`: docker build + c2w convert. - `wasm-agents/wasmtime-shim`: bash shim that injects --env flags c2w-net doesn't forward (WASMTIME_EXTRA_ENV). - `entrypoint.sh`: default LLM_PROVIDER=xpressclaw, LLM_MODEL=local, XPRESSCLAW_URL=http://192.168.127.254:8935. End-to-end verified manually: echo JSONL prompt → c2w-net -invoke pi-agent.wasm --net=socket → pi streams thinking/text deltas from the host llama-server via the xpressclaw custom provider extension.

Makes xpressclaw fully testable on the container2wasm backend. Pi is now the default (`config.pi.enabled = true`). Persistent pool - `PiPool` caches one pi WASM subprocess per agent_id and reuses it across prompts. Amortizes the ~30s Bochs boot over the whole session. Dead processes auto-evict; errors evict too so the next prompt gets a fresh container. - `shutdown_all` wired into the server's shutdown signal. Tool-execution events → messages - `PiTurnResult` now carries a Vec<PiToolCall> with params and results (parsed from tool_execution_start / tool_execution_end). - Processor persists each tool call as a `tool_call` message and emits it over the conversation SSE stream. - "Created task" MCP results produce a `task_status` card, matching the old Rust loop's TASK_CREATED marker. Dispatcher live streaming - `start_dispatcher` now takes the pi_pool and event_bus; task runs stream text deltas to the linked conversation (if any) AND incrementally update the task message content, so the task UI shows real-time progress instead of only the final blob. Terminal / Logs tab (tmux-style) - New `pi_terminal` module broadcasts every pi stdout/stderr line on a per-agent channel with a 500-line tail replay. - `/api/agents/{id}/terminal` SSE route streams PiTerminalLine events. - `LogsTab.svelte` rewritten as a live terminal view: dark bg, monospace, stdout/stderr colour split, timestamped, autoscrolling, auto-reconnecting EventSource. wanix-server removed - Server no longer spawns `node wanix-server/index.mjs`. - `find_wanix_server` helper deleted. - wanix-server/ source tree untouched for now (can be removed wholesale in a follow-up cleanup). Snapshotting is still a stub — the current bundle is baked into the WASM so there's nothing the agent can modify at runtime yet; snapshot work waits on c2w bind-mounts for ~/.pi/.

Three test-blockers fixed so `xpressclaw up` works without manual config: - `PiProcess::spawn` now resolves `wasm_path` and `wasmtime_shim` against cwd, $XPRESSCLAW_REPO, exe-dir, and the dev-tree root. Returns a clear error if missing instead of crashing inside c2w-net. - `c2w-net` looked up via $PATH and ~/.local/bin, ~/bin, ~/.cargo/bin before erroring out. - Three remaining `Config { ..Default::default() }` sites in setup.rs (GGUF download completion, async-download path, add_agent) now preserve `old_config.pi` instead of silently resetting to defaults when the LLM router is rebuilt mid-session.

When an agent transitions to desired=running, the reconciler now spawns the pi WASM container in the background so the Logs tab shows the boot immediately and the first user message has a hot process waiting for it. Stop transitions evict the cached process. Lifecycle moves into the reconciler — single source of truth for agent → pi-process correspondence. Removes the lazy spawn-on-first- prompt surprise.

Indented blocks in doc comments are interpreted as doctest code. The box-drawing characters (──, ▶, │, └) tripped the lexer with 'unknown start of token: \u{2500}'. Wrapped each block in ```text``` fences and replaced the unicode glyphs with ASCII so the diagram still renders cleanly in rustdoc and on GitHub.

… LLMs - Removed local-llm feature and all llama.cpp dependencies from xpressclaw-core - Removed local_model_path config option (users must now use Ollama) - Kept local_model config option for Ollama provider (still supported) - Removed embedded model download support (GGUF downloading) - Removed /download-status endpoint - Removed resolve_gguf_source() and related GGUF repo mapping code - Removed local-llm cfg guards from state.rs, setup.rs, and router.rs - Updated default LLM provider from 'local' to 'ollama' in config template - Removed local_model pricing test and local_model_zero_cost budget test - All crates build successfully without local-llm feature

wmeddie added 30 commits April 12, 2026 08:17

fix: system prompt enforces concise one-sentence tool reasoning

ac324d8

Appends instruction to every agent's system prompt: "When using tools, write ONE short sentence explaining your intent. No plans, no lists."

fix: filter Gemma internal tokens from content output

be169e2

Gemma 4 leaks special tokens like <channel|>, <turn|>, and literal "thought" text into the content field. These are now filtered from both the streaming output and the stored message.

fix: task dispatcher uses build_from_config for LLM router

d11393b

LlmRouter::new() creates an empty router with no providers. The dispatcher needs build_from_config() to register the OpenAI provider with the correct base_url and API key.

fix: task card uses message_type 'task_status' to match frontend

54dfc40

The conversation page renders task cards for messages with message_type='task_status'. The create_task tool was using 'task_created' which the frontend didn't recognize.

fix: Wanix workspace auto-creates on first tool request

0b0ea8f

The workspace tmpfs mount wasn't ready when the first tool call arrived. Now ensureWorkspace() runs before every tool/app request, mounting a tmpfs if the workspace doesn't exist yet. Idempotent.

wmeddie added 11 commits April 12, 2026 20:10

fix(ci): clippy -D warnings + cargo fmt

6d3ded4

fix(ci): cargo fmt — wrap long fn signature

4ecb68e

wmeddie closed this May 7, 2026

wmeddie deleted the spike/wanix-agents branch May 7, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: remove embedded llama.cpp, require Ollama for local LLMs#101

refactor: remove embedded llama.cpp, require Ollama for local LLMs#101
wmeddie wants to merge 41 commits into
mainfrom
spike/wanix-agents

wmeddie commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant