hal0 turns a Linux box — ideally a Ryzen AI Max+ 395 with 128 GB of unified memory — into a polished, OpenAI-compatible inference appliance. A unified runtime, hardware-aware slots, prewired chat UI, signed self-update. One command installs the lot.
It is not another llama-server wrapper. v0.2 ships AMD's Lemonade
Server as the unified inference runtime behind a thin hal0 capability
layer. Every workload — chat, embed, rerank, STT, TTS, image gen —
runs as a real, named slot in capabilities.toml; the API surface
covers the full OpenAI-compatible matrix; the dashboard is for
operating the box, not chatting with it.
curl -fsSL https://hal0.dev/install.sh | bashStatus: v0.2.0 — first release on the Lemonade runtime. Six per-modality toolbox containers, the
hal0-slot@.servicetemplate, and the legacy Provider stack all retired in favour of onehal0-lemonade.servicesupervising a singlelemonddaemon. v0.1.x installs do not auto-upgrade — seedocs/v0.2-upgrade.mdfor the back-up + wipe
- reinstall procedure and the
hal0 registry importrecovery path. First run lands a bundle picker (hal0-Lite/Default/Pro/Max+LMX-Omni-52B-Halo) —capabilities.tomlships empty by design, no silent default. Expect rough edges: APIs may shift, KV% for GPU slots reads—in v0.2 (see ADR-0008 §Costs), and we don't promise upgrade compatibility across0.2.xtags. SeePLAN.md§1 for what ships now and the path to v1.0.
Strix Halo native, but not Strix-Halo-only. The probe is UMA-aware
on Strix Halo and falls back to portable parsers (/proc/cpuinfo,
/proc/meminfo, lspci) on every other host. The platform field on
/api/hardware resolves to one of strix-halo, wsl2, lxc,
proxmox-kvm, kvm, bare-metal-{nvidia,amd,intel}-gpu, or
bare-metal-cpu-only, and the dashboard only labels memory as
"unified" when it actually is. The slot-fit warnings size against the
real unified pool, not a BAR carve-out. The XDNA NPU is a first-class
device — opt-in even at Pro/Max tier, packing three model roles into
one process via the FLM trio (see below). 128 GB Ryzen AI Max+ 395 is
the reference deployment — not a hopeful port.
One inference runtime, six modalities. Lemonade Server's per-type LRU policy lets chat, embed, rerank, STT, TTS, and image generation share one process pool. Concurrent chat + embed + image on Strix Halo measured at 1.10s wall-clock under spike #2 with zero evictions; chat
- embed alone sustains the same ~258 tok/s baseline the v0.1 toolbox
stack hit, with
--parallel 1 --threads Nmandatory in the daemon config to keep the Vulkan dispatch from oversubscribing CPU cores (memory:hal0_lemonade_threads_deadlock).
NPU multi-role via the FLM trio. On Strix Halo with FastFlowLM
installed, a single flm serve process hosts chat + transcription +
embedding concurrently on the one AMDXDNA hardware context — ~2 GB
NPU memory, gemma3:1b at 40 tok/s + Whisper-V3-Turbo + Embedding-Gemma
all coresident. hal0 exposes this as three slots (agent, stt-npu,
embed-npu) that all back the same FLM child via direct port dispatch.
See ADR-0009 for
why.
Dispatcher with single-flight + OmniRouter tool-calling. Registry-aware
routing across local slots and external upstreams (OpenRouter,
Anthropic, OpenAI, custom). v0.2 adds a client-side OmniRouter loop:
8 OpenAI tool-calling tools (image generation/editing, TTS,
transcription, vision, embed, rerank, route_to_chat) dispatched
through hal0 from any chat slot whose model carries the tool-calling
label. The set is filtered per request — a tool only appears in the
prompt if its target slot is enabled and its model carries the
required labels.
Reliability bar. Atomic env file writes. Schema-validated TOML.
Structured error envelopes ({"error":{"code":"slot.not_ready",...}}).
Cosign-verified self-update with one-flag rollback. Per-type LRU
concurrency with active-inference protection — a serving slot cannot
be evicted out from under a streaming request.
- OpenAI-compatible
/v1/*API — chat, completions, embeddings, rerank, audio transcriptions, audio speech, image generations, image edits, models. Drop-in for any OpenAI SDK; point your client athttp://localhost:8080/v1and go. - Slots — each named target in
capabilities.tomlcarries a Lemonade-vocabtype(llm | embedding | reranking | transcription | tts | image), adevice(gpu-rocm | gpu-vulkan | cpu | npu), amodel, plusenabledand optionaldefault. Six seeded slots (primary,embed,rerank,stt,tts,img) plus three NPU slots (agent,stt-npu,embed-npu) when FastFlowLM is installed. User-added slots viahal0 slot add NAME --type TYPE --model MODEL. - Lemonade as the unified runtime — one
lemondprocess, loopback-only on:13305, supervised byhal0-lemonade.service. Cache + config at/var/lib/hal0/lemonade/. The hal0 capability layer is the only user-facing inference surface; Lemonade is treated as an internal runtime, never exposed off-box. - Bundle picker on first run —
capabilities.tomlships empty by design. The dashboard's first load surfaces four hardware-anchored tiers (hal0-Lite≥16 GB /Default≥32 GB /Pro≥64 GB /Max≥100 GB) plus the vendor-blessedLMX-Omni-52B-Halokit. Tiers that don't fit the detected unified RAM grey out with a tooltip. See ADR-0010 for the no-silent-default rationale. - Hardware-aware probe — detects GPU / NPU / unified memory,
writes
/etc/hal0/hardware.json, surfaces VRAM/RAM fit warnings inline in the slot form and the bundle picker. - Dispatcher — registry-aware routing, cold-cache prefetch, upstream fallback (OpenRouter, Anthropic, OpenAI, custom OpenAI-shaped endpoints). Mix local + remote per-model in one config.
- Dashboard — Vue 3 + Tailwind 4 UI for slot/model management,
hardware-aware configuration, live logs, and system health. SSE-backed
status + log tail. Lemonade
/logs/streamfolded into the Journal panel. Settings → Lemonade admin panel surfaces the daemon's/internal/configfor inspection. Dark by default. - OpenWebUI prewired — chat at
:3001, zero config. The installer writesopenwebui.envpointing at the local hal0 API. - OmniRouter (8 tools) —
generate_image,edit_image,text_to_speech,transcribe_audio,analyze_image(vision),embed_text,rerank_documents,route_to_chat. Dispatched client-side from chat slots; dynamically filtered per request. - Image generation, day one —
POST /v1/images/generationsviasd-cpp(Lemonade-bundled). Bundle manifests pre-pick SDXL Turbo / Flux-2-Klein-9B as fits the tier. - First-run wizard + bundle picker — bundle pick (or "Skip — configure manually") → models download in background.
- Atomic self-update with rollback —
hal0 update --channel stable|nightly. Cosign-verified tarballs swap a/usr/lib/hal0/currentsymlink;--rollbackreverts. - One-line install —
curl -fsSL https://hal0.dev/install.sh | bash(--models-dir=PATHorHAL0_MODELS_DIR=PATHredirects model pulls off/var/lib/hal0/models). The bootstrap fetches the release manifest, sha256-verifies the tarball, cosign-verifies the signature against the workflow OIDC identity, then hands off toinstaller/install.sh. v0.1.x installs are detected and refused with explicit backup/wipe instructions (seedocs/v0.2-upgrade.md).
hal0 ships two MCP servers and one bundled agent app. The MCP
servers (/mcp/admin for slot / model / capability / config / hardware
/ log admin and /mcp/memory for Cognee-backed long-term memory) are
reachable by any MCP-speaking client — Claude Code, future RAG
services, external scripts. The bundled agent is single-pick at install:
pi-coder (CLI shape, installed from Hal0ai/pi-mono fork via
@earendil-works/pi-coding-agent on npm) or Hermes-Agent (service
shape, installed via the hal0-owned hal0-hermes wrapper around
upstream hermes). Pick one via the first-run wizard or hal0 agent install <name>; swap atomically with --switch. Capital-D destructive MCP
calls (model_pull, slot_delete, config_write, etc.) gate through
a header bell + inbox modal in the dashboard, with CLI parity via
hal0 agent approvals {list,approve,deny}. See
docs/api/mcp.md and
docs/api/agents.md.
v0.2 routes every inference workload through Lemonade Server. The table below lists the Lemonade recipe each capability uses, plus the hardware device it targets.
| Capability | Lemonade recipe | Device | Notes |
|---|---|---|---|
| chat + embed + rerank | llamacpp |
Vulkan / ROCm / CPU | --parallel 1 --threads N mandatory in lemond config |
| chat + STT + embed (NPU) | flm:npu |
AMD XDNA (opt-in) | FLM trio: one process, --asr 1 --embed 1, ~2 GB NPU mem |
| transcription | whisper.cpp |
Vulkan / CPU | Replaces v0.1 Moonshine for the CPU/GPU path |
| TTS | kokoro:cpu |
CPU | [CPU] chip + tooltip in dashboard; GPU TTS deferred to v0.3 |
| image | sd-cpp |
ROCm / Vulkan | SD Turbo / Flux-2-Klein-9B selectable per bundle tier |
The NPU path is opt-in: a FastFlowLM .deb install (manual on Linux —
the Lemonade auto-installer is Windows-only as of v0.2) unlocks the
three FLM slots. With FLM present, the bundle picker's Pro and Max
tiers surface the trio as a toggle; the trio only auto-enables when a
bundled agent is being installed in the same first-run flow.
The hal0 toolbox containers (hal0-toolbox-vulkan / rocm / flm /
moonshine / kokoro / comfyui) and the hal0-slot@.service
template that supervised them are retired. v0.2's lemond daemon
takes their place. See ADR-0008
for the migration rationale.
Linux + systemd is the only hard requirement
(installer/install.sh:86). macOS and Windows
are not in scope for v1.
| Tier | Hardware | Status |
|---|---|---|
| First-class | AMD Ryzen AI Max+ 395 ("Strix Halo") with iGPU + XDNA NPU + 128 GB unified | Reference deployment. All published perf numbers come from this box. |
| First-class | AMD Ryzen AI Max 385 / 390 with 64 GB unified | Same path; small + mid tiers fit, 70B Q4 with shorter context. |
| Supported | NVIDIA RTX 30/40/50 (10–32 GB) | CUDA-backed llamacpp via Lemonade. Same slot lifecycle, dedicated VRAM instead of UMA. |
| Supported | AMD Radeon RX 7000 / discrete (16–24 GB) | ROCm via Lemonade's llamacpp recipe; Vulkan fallback. |
| Fallback | CPU-only x86_64 | Lemonade's CPU path. Usable for tiny models / smoke tests, not the headline experience. |
hal0/
├── src/hal0/ # Python package (FastAPI API + capability layer + Lemonade client + CLI)
│ ├── lemonade/ # HTTP client + catalog_sync + metrics_shim + log_proxy
│ └── omni_router/ # client-side tool-calling loop + tool definitions
├── ui/ # React 18 + TypeScript + Vite + Tailwind 4 dashboard (v3, in flight)
├── ui-vue.bak/ # v0.2.1 Vue 3 dashboard preserved verbatim for reference
├── installer/ # install.sh (writes /var/lib/hal0/lemonade/config.json + hal0-lemonade.service)
├── tests/ # pytest suite (α unit, β integration, γ release-gate)
├── docs/ # user docs (mirror of hal0.dev/docs); dev docs under docs/internal/
└── PLAN.md # roadmap (current cut: v0.2; path to v1.0)
Models live under /var/lib/hal0/models/<recipe>/<capability>/ — the
canonical app-visible tree Lemonade's extra_models_dir points at.
Per-leaf symlinks redirect to /mnt/ai-models/ when a separate
storage volume is in use. See PLAN.md §6.1 for the layout.
# backend
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
hal0 serve --reload
# frontend
cd ui
npm install
npm run devThe API lives at http://127.0.0.1:8080, the dashboard at the Vite dev
server URL (usually http://127.0.0.1:5173).
Run hal0 doctor any time to re-check pre-flight (systemd / python /
disk / ports). hal0 model pull <ref> streams models from Hugging Face
into the registry under the user.* namespace, and hal0 uninstall [--keep-data] tears down a running install (thin wrapper over
installer/uninstall.sh).
Per ADR-0012 (supersedes
ADR-0001), hal0 ships with no built-in auth and no bundled TLS.
hal0-api binds 0.0.0.0:8080 and treats every request as
authenticated by virtue of network reachability — the right posture
for a homelab appliance on a trusted LAN.
If hal0 is reachable from anything you don't physically control,
front it with an upstream reverse proxy that owns auth + TLS — Traefik,
nginx, Cloudflare Tunnel, or your own preferred edge. See
docs/operate/auth.mdx for example configs
of each.
If hal0 runs inside a Proxmox LXC, the container only sees its own
cgroup slice of memory — other tenants, ZFS ARC, and the host kernel
draw from the same physical DIMMs as GPU GTT but are invisible from
inside. To surface that, drop a read-only PVEAuditor API token into
the dashboard's Settings → "Proxmox integration" panel. Once saved,
the unified-memory bar swaps to the physical host's DIMM total and
adds a muted "Proxmox host" segment for other-tenant + kernel
pressure. Token is sensitive and stored 0600 at
/etc/hal0/proxmox.json; the API never echoes it back. Bare-metal and
VM installs leave the panel off and the dashboard stays quiet.
No dates — items are direction. The closer to the left, the closer to running on your box. Full version at hal0.dev/roadmap.
- Lemonade Server unified inference runtime — six per-modality
toolbox containers + the
hal0-slot@.servicetemplate retire;hal0-lemonade.servicesupervises onelemonddaemon - FLM trio NPU packing — chat + ASR + embed coresident on one
AMDXDNA hardware context via
--asr 1 --embed 1 - OmniRouter client-side tool-calling — 8 tools, dynamic
per-request filtering,
route_to_chatcross-slot delegation - First-run bundle picker —
hal0-Lite/Default/Pro/Max+LMX-Omni-52B-Halo, hardware-anchored, no silent default - Per-type LRU concurrency — six type budgets (LLM, embedding, reranking, transcription, tts, image), nuclear-evict only on unrecognised load errors
- Settings → Lemonade admin panel —
/internal/configsnapshot +/internal/setatomic config writes - Journal panel — Lemonade
/logs/streamfolded into Logs tab - Metrics shim — per-slot TTFT + tok/s + prompt_tokens via
/v1/stats; FLM-native KV% on NPU slots hal0 registry import— one-shot v0.1.x → v0.2 registry recovery from the backup tarball- Carried forward from v0.1.x: OpenAI-compatible
/v1/*, portable hardware probe, capability slots overlay + orchestrator, dispatcher with single-flight + cold-cache prefetch, bundled OpenWebUI on:3001, auth (password + tokens) in FastAPI, cosign-keyless self-update with rollback, bundled agents (pi-coder / Hermes-Agent)- MCP admin + Cognee memory MCP
- GPU-accelerated TTS — kokoro-vulkan or successor; closes the
[CPU]chip on the voice slot card - KV% for GPU slots — Lemonade upstream or hal0-built llama-server
swap-in; v0.2 ships
— - Phase 8 polish + advanced memory — Cognee graph + Memify pipeline enabled, MCP client side of hal0 (agents reach external MCP servers), federated memory across local + remote sources
- Benchmarks & presets UI — in-dashboard tok/s + latency runs, plus curated loadout presets you can flash onto a fresh install
- AUR PKGBUILD & Ubuntu PPA — native distro packages on top of the install script; pacman and apt as first-class install paths
hal0.localmDNS auto-discovery polish- Light mode toggle
- Multi-host federation — a slot mesh across LAN boxes — primary
on the Strix Halo, embed on the workstation, all behind one
/v1/*surface - Fine-tune & LoRA hot-swap — attach and rotate LoRAs against a warm base model without unloading the underlying weights
- Per-model rate limits & budgets — cost-style accounting for local inference — cap a chatty agent without taking the whole box down
- Voice mode end-to-end — Lemonade's
/realtimeWebSocket + agent loop stitched into a hands-free streaming conversation - ChatOps adapters — Slack and Matrix bridges as extensions — talk to hal0 from the rooms you already live in
Apache 2.0. See LICENSE.
The contribution model is still being decided
(PLAN.md §16). File issues for discussion; PRs aren't
being accepted from outside contributors yet. See
CONTRIBUTING.md for the test tiers and the
eventual flow.