A fully local AI voice backend for the M5Stack StackChan
desktop robot. It replaces the xiaozhi.me cloud so the robot hears, thinks, and
speaks entirely on your own hardware: no cloud account, no API keys, no audio
leaving your network.
The stock StackChan firmware (the xiaozhi-esp32 AI agent) connects to a cloud
backend for speech-to-text, the language model, and text-to-speech. warble
implements that same protocol locally, so you point the robot at your machine and
nothing else changes on the device. It handles the full voice conversation
(listen, transcribe, reply, speak) and drives the robot's animated face.
StackChan robot (CoreS3, xiaozhi firmware)
|
| Wi-Fi / WebSocket: you speak (audio up), it replies (audio + face down)
v
warble server (Go) - WebSocket :8000, OTA :8003
|
+--> whisper.cpp speech-to-text :8082
+--> Silero VAD voice detection :8005
+--> Piper text-to-speech :8001
+--> Ollama language model :11434
The server runs the whole turn locally: whisper.cpp transcribes what you say, Silero VAD detects when you stop, Ollama generates the reply, and Piper speaks it. The server also sets the robot's face from an emotion tag in each reply.
A real conversation captured with ./warble transcript. What it heard, the reply, and the mood driving the robot's face all ran on a local machine. No cloud, no account.
New here? Follow the step-by-step Getting Started guide. It walks through everything below in plain language.
warble runs on Linux or macOS. (Windows is not supported: it's bash-based and talks to Unix serial ports for flashing the robot.)
- An M5Stack StackChan (CoreS3) running its
xiaozhi-esp32AI-agent firmware. - A Linux or macOS computer on the same Wi-Fi/LAN to run warble, with Docker installed and running (Docker Engine on Linux, Docker Desktop on macOS).
- Room for the AI model. A small model like
gemma3:4bruns on a modest machine; bigger models want more RAM/VRAM and ideally a GPU. On macOS, run Ollama natively for GPU acceleration (the setup script handles this). - Disk for models: whisper from ~150 MB (
base) to ~3 GB (large-v3), a Piper voice (~60 MB), and the AI model.
One command does everything: checks Docker (and offers to install it), fetches models, configures, builds, starts, and tells you when it's ready.
git clone https://github.com/rebelthor/warble
cd warble
./warble startAt the end it offers to point your robot at this computer (plug it in via USB and confirm). Then open the AI Agent app on the robot and talk. You can also do this step later, or redo it, any time:
./warble connect # point the robot at this computer
./warble disconnect # send the robot back to the M5Stack cloudPointing the robot reprograms it over USB and needs esptool on this computer
(pipx install esptool, or pip install esptool).
One command, a few verbs:
./warble status # plain-language health check (running? ready?)
./warble transcript # live conversation: what it heard (USER) and replied (BOT)
./warble logs server # raw server logs (or ./warble logs for everything)
./warble stop # stop everything
./warble start # start again (safe to re-run any time)
./warble restart # bounce the running servicesIf ./warble status says nothing is running, make sure Docker is running
(on macOS, open Docker Desktop; it does not auto-start after a reboot), then run
./warble start again.
Two layers, split by how often they change.
Live (edit the file, takes effect within ~1 second, no restart):
| File | What |
|---|---|
server/config/prompt.txt |
The robot's personality / instructions. Keep the emotion-tag line (see below). |
server/config/runtime.json |
piper_voice, ollama_model, and sampling (temperature, top_p, num_predict). |
Startup (set in .env, then ./warble start to apply): the AI model location,
whisper model + language, the Piper voice. See .env.example.
warble is language-neutral; three settings pick the language:
- whisper model +
WHISPER_LANGin.env(e.g.de,ro, orauto). - Piper voice, fetch one for your language (browse
piper voices) and set
PIPER_VOICE. - Prompt, write
server/config/prompt.txtin your language.
Every reply begins with one tag from
[neutral|happy|laughing|angry|sad|crying|doubtful]. The server removes it before
speaking and sends a matching face to the robot. It is a firmware requirement, not
decoration: keep that instruction in any custom prompt (prompt.example.txt shows
the exact wording). If the model omits it, the server uses neutral.
- Prebuilt images.
./warble startpulls prebuilt images fromghcr.io/rebelthor/warble-*(fast) and falls back to building locally if they can't be pulled. Force a local build with./warble start --build. Pin a version withWARBLE_VERSIONin.env. Publishing images: seedocs/releasing.md. - Manual Docker control and the macOS/Linux Ollama details: see the comments
in
docker-compose.ymland.env.example. - Run without Docker (install whisper.cpp / Piper / Ollama yourself and use the
bin/warblesupervisor): seedocs/operations.md. - Protocol details:
server/PROTOCOL.md.
warble's own code is MIT. It orchestrates external programs you install; see
THIRD_PARTY_LICENSES.md. One boundary matters:
Piper (piper-tts) is GPL-3.0. warble talks to it over HTTP, so warble stays
MIT. The prebuilt warble-piper image bundles Piper, so that image is distributed
under GPL-3.0 (labeled accordingly; its corresponding source is
piper-shim/Dockerfile plus upstream piper1-gpl). The other three images
(warble-server, warble-whisper, warble-vad) are MIT.
Protocol implemented from the xiaozhi-esp32
docs (WebSocket).
Issues and PRs welcome. See CONTRIBUTING.md.
