Skip to content

refactor: remove embedded llama.cpp, require Ollama for local LLMs#101

Closed
wmeddie wants to merge 41 commits into
mainfrom
spike/wanix-agents
Closed

refactor: remove embedded llama.cpp, require Ollama for local LLMs#101
wmeddie wants to merge 41 commits into
mainfrom
spike/wanix-agents

Conversation

@wmeddie
Copy link
Copy Markdown
Member

@wmeddie wmeddie commented May 6, 2026

Summary

This PR removes support for embedded llama.cpp (GGUF model downloading) from xpressclaw. Local LLM support now requires users to run Ollama separately.

Changes

  • Removed feature and all llama.cpp dependencies (, , )
  • Removed config option (users must use Ollama)
  • Kept config option for Ollama provider (still supported)
  • Removed embedded model download support (GGUF downloading)
  • Removed endpoint
  • Removed and related GGUF repo mapping code
  • Updated default LLM provider from to in config template
  • Removed local model pricing tests and zero-cost budget tests

Migration Guide

Users who want to run local LLMs must now:

  1. Install and run Ollama:

  2. Pull a model:

  3. Update xpressclaw config ():

  4. Restart xpressclaw:

Benefits

Before After
~6GB RAM for embedded llama.cpp ~0.5GB for Ollama
No GPU offloading (CPU-only) Full GPU passthrough (CUDA/Metal/ROCm)
No GPU passthrough Full GPU passthrough
No model hot-restart Ollama supports live model hot-restart
Complex build deps Zero build deps (just run Ollama)

Testing

  • ✅ - passes
  • ✅ - passes
  • ✅ - passes

All crates build successfully without the feature.

wmeddie added 30 commits April 12, 2026 08:17
Why: Docker caused persistent friction — networking, image pulls, MCP
tool routing, session continuity, opaque agent workspaces. The spike
replaces Docker with Wanix as the agent execution environment.

What:
- Removed bollard dependency and entire docker/ module
- Gutted reconciler to simple DB state management (no containers)
- Replaced harness-based processor with server-side agentic loop
  that streams from LLM and executes tools (memory, tasks) directly
- Added Wanix runtime assets (wanix.min.js, wanix.wasm) to frontend
- Created WanixPanel component and wanix-bridge TypeScript module
- Created wanix-server/test-boot.mjs proving Wanix kernel boots
  headlessly in Node with browser API shims (OPFS, DOM, etc.)
- Added AGENTS.md with full project documentation

Status: Wanix kernel boots in Node, filesystem mount needs work.
The server-side agentic loop works with streaming + tool execution.
Docker is completely gone from the codebase.
Wanix kernel runs in headless Chrome via puppeteer-core. The boot
sequence allocates a tmpfs capability, mounts it, and binds it at
workspace/ — giving agents a writable filesystem without Docker.

Validated: file write/read, mkdir, nested directories, task creation.
The wanix-server/index.mjs is the headless Wanix host process.

Next: wire the xpressclaw agentic loop to execute tools through
the Wanix instance via page.evaluate().
The agentic loop now routes filesystem tools (Read, Write, ListDir,
MakeDir) through the Wanix headless server at localhost:9100. The
Wanix server runs Wanix in headless Chrome and exposes an HTTP API
for tool execution via page.evaluate().

To test:
1. Start wanix: node wanix-server/index.mjs
2. Start xpressclaw: ./target/release/xpressclaw up
3. Send a message asking the agent to create a file
4. The agent's Write tool call goes through Wanix
The xpressclaw server now starts the Wanix headless server
(node wanix-server/index.mjs) as a managed child process. It
starts automatically on server boot and is killed on shutdown.

No manual commands needed — just start xpressclaw and Wanix
is available for filesystem tool execution.

Also fixed: agent always stores a response even when the last
LLM turn was tool-only (no text).
The task dispatcher was still trying to connect to a Docker harness
on port 0, which always failed. Replaced with the same agentic loop
pattern as the conversation processor: stream from LLM router,
accumulate tool call deltas, execute tools via Wanix, loop.
- Deleted docker/ directory (files were still on disk)
- Memory hooks (recall/remember/consolidate) now call LLM router
  directly instead of HarnessClient
- Removed container_id from agent JSON response
- Task dispatcher fully uses LLM router with tool execution
- harness.rs still exists as dead code (to be cleaned up)
The Wanix server now handles GET /app/{name}/{path} to serve files
directly from the agent's workspace. The xpressclaw app proxy routes
/apps/{app_id}/{path} to the Wanix server.

Agent creates files → writes to Wanix workspace → publish_app saves
to DB → sidebar shows app → clicking app serves from Wanix filesystem.

No Docker containers needed to serve apps.
Small models sometimes call the same tool repeatedly in a loop
(e.g. create_task with identical args on every turn). Now caches
tool results by (name, args) and returns the cached result on
duplicates. Also reduced max turns from 15 to 10.
When the model calls the same tool with identical arguments, it now
gets an error message telling it the result was already returned and
to move on. If an entire turn is nothing but duplicate calls, the
loop breaks immediately.

When max turns are exhausted without producing text, the agent stores
a message explaining it ran out of turns.
Tool calls now require the model to produce reasoning text explaining
WHY it's making the call. This text streams to the chat so users see
the agent's thinking before each tool execution.

Three-layer loop protection:
1. No reasoning text → tool call BLOCKED, model told to explain itself
2. Duplicate reasoning → loop detected, break immediately
3. Duplicate tool args → returns cached result with "ALREADY DONE"

This prevents small models from endlessly repeating the same tool
call. The reasoning text also makes the chat more informative —
users see "I need to create a task for..." instead of just an icon.
The agentic loop now streams reasoning_content from the LLM as
<think>...</think> wrapped chunks, which the frontend renders as
collapsible thinking blocks. This shows the model's reasoning
before tool calls.

Text content transitions close the thinking block automatically.
- Text content that precedes tool calls is now buffered and emitted
  as a <think> block, so it renders as collapsible reasoning — not
  as regular chat text. Text-only turns (no tools) still stream normally.
- The BLOCKED prompt now asks for ONE concise sentence, not verbose
  explanation.
- Model's CoT (reasoning_content) still renders as separate think
  blocks from the model's thinking.
Appends instruction to every agent's system prompt: "When using tools,
write ONE short sentence explaining your intent. No plans, no lists."
The reasoning gate was blocking tool calls when the model produced
thinking in reasoning_content but no content text. Small models then
got confused by the BLOCKED message and generated garbage.

Fix: accept either content text OR reasoning_content as valid
reasoning. Drop the BLOCKED retry flow entirely — just use
duplicate detection to prevent loops. The system prompt now says
"write ONE short sentence stating what you're about to do" with
an example.
The duplicate reasoning check was breaking after 1 tool call because
the model's CoT thinking produces similar hashes across turns. Now
loop detection keys on the actual tool calls (name + args). Only
breaks when the exact same set of tool calls repeats — which is the
actual loop condition.
The llama-server has limited connection slots. Reqwest's connection
pooling kept connections open after streaming, blocking subsequent
requests. Adding Connection: close ensures each request fully
releases the connection before the next one.
Gemma 4 leaks special tokens like <channel|>, <turn|>, and literal
"thought" text into the content field. These are now filtered from
both the streaming output and the stored message.
Tasks created via the create_task tool were only inserted into the
tasks table but never enqueued in task_queue. The dispatcher only
processes queued tasks, so created tasks sat in pending forever.

Now create_task also calls queue.enqueue() so the task dispatcher
picks it up and executes it.
LlmRouter::new() creates an empty router with no providers.
The dispatcher needs build_from_config() to register the OpenAI
provider with the correct base_url and API key.
Two proper fixes:

1. Background processor scanner: a 5-second polling loop that checks
   all conversations for unprocessed messages and spawns processors.
   This catches messages injected by the task dispatcher, connectors,
   or any source other than the HTTP handler. Same pattern as the
   reconciler — reliable polling, no event threading required.

2. System notification styling: messages from the task dispatcher
   (sender_id="system", message_type="task_wake") now render as
   small, centered, muted notifications instead of user message
   bubbles. The SYSTEM: prefix is stripped from display.
Two fixes:

1. Scanner race condition: the background scanner now only picks up
   unprocessed messages older than 10 seconds. This prevents racing
   with the HTTP handler which spawns processors immediately. Added
   oldest_unprocessed_age() to ConversationManager.

2. Task card in conversation: create_task now injects a system message
   with task JSON so the conversation page renders the task card
   immediately when a task is created, not just when it completes/fails.
The conversation page renders task cards for messages with
message_type='task_status'. The create_task tool was using
'task_created' which the frontend didn't recognize.
4096 tokens was too small — models with large context windows
(262k) were hitting finish_reason=length before completing tool
call arguments, especially for file writes with long content.

32768 gives enough room for multi-file tool calls while still
being conservative. This should really be configurable per-model.
Task status messages in conversations are now updated in-place
instead of creating new messages for each status change. The
dispatcher looks for an existing task_status message with the
same task_id and updates its content. Only creates a new message
if none exists.

This means the conversation shows one card per task that
transitions: pending → in_progress → completed/failed.
Three changes for proper task card UX:

1. TaskCard component: polls the task API every 3s to get live
   status, subtask progress, and title updates. Stops polling
   when the task reaches a terminal state (completed/failed).

2. Task card appears after agent response: the create_task tool
   returns a marker, and the processor injects the card + broadcasts
   it via the event bus AFTER storing the agent's response. This
   ensures correct ordering in the conversation.

3. Event bus broadcast: the task card message is broadcast so the
   frontend sees it appear live without needing a page reload.
Two fixes:

1. Wanix process was piping stdout/stderr but nobody read them.
   When the pipe buffer filled (Chrome console output), the process
   blocked. Changed to null stdout + inherit stderr so output goes
   to the server's terminal without blocking.

2. Task dispatcher now has same fixes as conversation processor:
   garbage token filtering, tool intent prompt, loop detection
   via seen_tool_keys set.
When the model leads with tool calls (no text content), the dispatcher
was storing "(No response)" because full_content was empty. Now tool
execution results are accumulated into full_content, so the task
message shows what the agent actually did:

  **Write**: Wrote 500 bytes to calculator.py
  **search_memory**: No memories found.

Also added loop detection (same as conversation processor) to
break when the same tool calls repeat.
The workspace tmpfs mount wasn't ready when the first tool call
arrived. Now ensureWorkspace() runs before every tool/app request,
mounting a tmpfs if the workspace doesn't exist yet. Idempotent.
The task dispatcher was giving the agent ALL tools including
list_tasks, create_task, and search_memory. The agent spent all
its turns checking task status instead of doing the actual work.

Now tasks get a focused tool set: Write, Read, ListDir, MakeDir,
and complete_task. No list_tasks, no create_task, no search_memory.

Also: continuation prompt now includes the task title and description
so the model knows what to build, and tells it not to call list_tasks.
Per-tool call counter breaks after 3 calls to the same tool name.
…isting

Two critical fixes:

1. Task dispatcher now includes previous task conversation messages
   as LLM context. Previously it only sent [system, current_prompt]
   so the model had no memory of what it already did across turns —
   causing it to repeat MakeDir and ListDir endlessly.

2. Wanix ListDir with empty/"." path now resolves to "workspace"
   instead of "workspace/" or "workspace/." which failed.
wmeddie added 11 commits April 12, 2026 20:10
The TaskCard was showing "0/21 subtasks" because it used the global
task counts from the API instead of counting the actual subtask list.
Now counts tasks from sub.tasks array directly, so only real subtasks
(with matching parent_task_id) are shown.
The health check was showing "Reconnecting..." after a single 3s
timeout, which happens regularly when the LLM is processing a long
request. Now uses /api/health (lightweight) instead of /api/agents,
5s timeout, and requires 3 consecutive failures before showing
the overlay. Prevents flickering during normal operation.
Created the Docker image for running pi-agent inside xpressclaw:
- containers/pi-agent/Dockerfile — Node 20 + pi + mcpfs + Go
- containers/pi-agent/entrypoint.sh — starts mcpfs mount then pi RPC
- containers/pi-agent/AGENTS.md — workspace context for the agent
- containers/pi-agent/extensions/xpressclaw-provider/ — pi extension
  that registers the local llama-server as an OpenAI-compatible provider

Validated: pi boots in the container, loads the extension, connects
to the local llama-server via the custom provider, and streams RPC
events (thinking_delta, text_delta, tool calls) over stdout.

RPC protocol: send {"type":"prompt","message":"..."} on stdin,
receive JSONL events on stdout.
Adds an opt-in backend that spawns pi-coding-agent inside a c2w
Linux VM (WASM) and talks to it over JSONL on stdin/stdout, replacing
the Rust-native agent loop when `pi.enabled = true`.

- `agents/pi_rpc.rs`: subprocess client for c2w-net + pi WASM, JSONL
  prompt/event protocol, text/thinking delta parsing.
- `routes/mcp_server.rs`: streamable-HTTP MCP endpoint at `/mcp` so the
  pi container's mcpfs can mount xpressclaw tasks/memory as files.
- `config::PiConfig`: wasm_path, c2w_net, wasmtime_shim, NAT-gateway
  URLs (192.168.127.254), LLM model defaults.
- `processor::run_pi_agent` and `dispatcher::call_agent_pi`: branch on
  config at the top of the loop; old Rust path remains as fallback.
- `scripts/build-pi-agent-wasm.sh`: docker build + c2w convert.
- `wasm-agents/wasmtime-shim`: bash shim that injects --env flags c2w-net
  doesn't forward (WASMTIME_EXTRA_ENV).
- `entrypoint.sh`: default LLM_PROVIDER=xpressclaw, LLM_MODEL=local,
  XPRESSCLAW_URL=http://192.168.127.254:8935.

End-to-end verified manually: echo JSONL prompt → c2w-net -invoke
pi-agent.wasm --net=socket → pi streams thinking/text deltas from the
host llama-server via the xpressclaw custom provider extension.
Makes xpressclaw fully testable on the container2wasm backend. Pi is
now the default (`config.pi.enabled = true`).

Persistent pool
- `PiPool` caches one pi WASM subprocess per agent_id and reuses it
  across prompts. Amortizes the ~30s Bochs boot over the whole
  session. Dead processes auto-evict; errors evict too so the next
  prompt gets a fresh container.
- `shutdown_all` wired into the server's shutdown signal.

Tool-execution events → messages
- `PiTurnResult` now carries a Vec<PiToolCall> with params and
  results (parsed from tool_execution_start / tool_execution_end).
- Processor persists each tool call as a `tool_call` message and
  emits it over the conversation SSE stream.
- "Created task" MCP results produce a `task_status` card, matching
  the old Rust loop's TASK_CREATED marker.

Dispatcher live streaming
- `start_dispatcher` now takes the pi_pool and event_bus; task runs
  stream text deltas to the linked conversation (if any) AND
  incrementally update the task message content, so the task UI
  shows real-time progress instead of only the final blob.

Terminal / Logs tab (tmux-style)
- New `pi_terminal` module broadcasts every pi stdout/stderr line
  on a per-agent channel with a 500-line tail replay.
- `/api/agents/{id}/terminal` SSE route streams PiTerminalLine events.
- `LogsTab.svelte` rewritten as a live terminal view: dark bg,
  monospace, stdout/stderr colour split, timestamped, autoscrolling,
  auto-reconnecting EventSource.

wanix-server removed
- Server no longer spawns `node wanix-server/index.mjs`.
- `find_wanix_server` helper deleted.
- wanix-server/ source tree untouched for now (can be removed wholesale
  in a follow-up cleanup).

Snapshotting is still a stub — the current bundle is baked into
the WASM so there's nothing the agent can modify at runtime yet;
snapshot work waits on c2w bind-mounts for ~/.pi/.
Three test-blockers fixed so `xpressclaw up` works without manual
config:

- `PiProcess::spawn` now resolves `wasm_path` and `wasmtime_shim`
  against cwd, $XPRESSCLAW_REPO, exe-dir, and the dev-tree root.
  Returns a clear error if missing instead of crashing inside c2w-net.
- `c2w-net` looked up via $PATH and ~/.local/bin, ~/bin, ~/.cargo/bin
  before erroring out.
- Three remaining `Config { ..Default::default() }` sites in setup.rs
  (GGUF download completion, async-download path, add_agent) now
  preserve `old_config.pi` instead of silently resetting to defaults
  when the LLM router is rebuilt mid-session.
When an agent transitions to desired=running, the reconciler now
spawns the pi WASM container in the background so the Logs tab
shows the boot immediately and the first user message has a hot
process waiting for it. Stop transitions evict the cached process.

Lifecycle moves into the reconciler — single source of truth for
agent → pi-process correspondence. Removes the lazy spawn-on-first-
prompt surprise.
Indented blocks in doc comments are interpreted as doctest code.
The box-drawing characters (──, ▶, │, └) tripped the lexer with
'unknown start of token: \u{2500}'. Wrapped each block in
```text``` fences and replaced the unicode glyphs with ASCII so
the diagram still renders cleanly in rustdoc and on GitHub.
… LLMs

- Removed local-llm feature and all llama.cpp dependencies from xpressclaw-core
- Removed local_model_path config option (users must now use Ollama)
- Kept local_model config option for Ollama provider (still supported)
- Removed embedded model download support (GGUF downloading)
- Removed /download-status endpoint
- Removed resolve_gguf_source() and related GGUF repo mapping code
- Removed local-llm cfg guards from state.rs, setup.rs, and router.rs
- Updated default LLM provider from 'local' to 'ollama' in config template
- Removed local_model pricing test and local_model_zero_cost budget test
- All crates build successfully without local-llm feature
@wmeddie wmeddie closed this May 7, 2026
@wmeddie wmeddie deleted the spike/wanix-agents branch May 7, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant