warble

A fully local AI voice backend for the M5Stack StackChan desktop robot. It replaces the xiaozhi.me cloud so the robot hears, thinks, and speaks entirely on your own hardware: no cloud account, no API keys, no audio leaving your network.

The stock StackChan firmware (the xiaozhi-esp32 AI agent) connects to a cloud backend for speech-to-text, the language model, and text-to-speech. warble implements that same protocol locally, so you point the robot at your machine and nothing else changes on the device. It handles the full voice conversation (listen, transcribe, reply, speak) and drives the robot's animated face.

StackChan robot (CoreS3, xiaozhi firmware)
      |
      |  Wi-Fi / WebSocket: you speak (audio up), it replies (audio + face down)
      v
warble server (Go)  -  WebSocket :8000, OTA :8003
      |
      +--> whisper.cpp   speech-to-text   :8082
      +--> Silero VAD    voice detection  :8005
      +--> Piper         text-to-speech   :8001
      +--> Ollama        language model   :11434

The server runs the whole turn locally: whisper.cpp transcribes what you say, Silero VAD detects when you stop, Ollama generates the reply, and Piper speaks it. The server also sets the robot's face from an emotion tag in each reply.

A real conversation captured with ./warble transcript. What it heard, the reply, and the mood driving the robot's face all ran on a local machine. No cloud, no account.

New here? Follow the step-by-step Getting Started guide. It walks through everything below in plain language.

What you need

warble runs on Linux or macOS. (Windows is not supported: it's bash-based and talks to Unix serial ports for flashing the robot.)

An M5Stack StackChan (CoreS3) running its xiaozhi-esp32 AI-agent firmware.
A Linux or macOS computer on the same Wi-Fi/LAN to run warble, with Docker installed and running (Docker Engine on Linux, Docker Desktop on macOS).
Room for the AI model. A small model like gemma3:4b runs on a modest machine; bigger models want more RAM/VRAM and ideally a GPU. On macOS, run Ollama natively for GPU acceleration (the setup script handles this).
Disk for models: whisper from ~150 MB (base) to ~3 GB (large-v3), a Piper voice (~60 MB), and the AI model.

Quick start

One command does everything: checks Docker (and offers to install it), fetches models, configures, builds, starts, and tells you when it's ready.

git clone https://github.com/rebelthor/warble
cd warble
./warble start

At the end it offers to point your robot at this computer (plug it in via USB and confirm). Then open the AI Agent app on the robot and talk. You can also do this step later, or redo it, any time:

./warble connect       # point the robot at this computer
./warble disconnect    # send the robot back to the M5Stack cloud

Pointing the robot reprograms it over USB and needs esptool on this computer (pipx install esptool, or pip install esptool).

Everyday use

One command, a few verbs:

./warble status        # plain-language health check (running? ready?)
./warble transcript    # live conversation: what it heard (USER) and replied (BOT)
./warble logs server   # raw server logs (or ./warble logs for everything)
./warble stop          # stop everything
./warble start         # start again (safe to re-run any time)
./warble restart       # bounce the running services

If ./warble status says nothing is running, make sure Docker is running (on macOS, open Docker Desktop; it does not auto-start after a reboot), then run ./warble start again.

Configuration

Two layers, split by how often they change.

Live (edit the file, takes effect within ~1 second, no restart):

File	What
`server/config/prompt.txt`	The robot's personality / instructions. Keep the emotion-tag line (see below).
`server/config/runtime.json`	`piper_voice`, `ollama_model`, and sampling (`temperature`, `top_p`, `num_predict`).

Startup (set in .env, then ./warble start to apply): the AI model location, whisper model + language, the Piper voice. See .env.example.

Use a different language

warble is language-neutral; three settings pick the language:

whisper model + WHISPER_LANG in .env (e.g. de, ro, or auto).
Piper voice, fetch one for your language (browse piper voices) and set PIPER_VOICE.
Prompt, write server/config/prompt.txt in your language.

The emotion tag

Every reply begins with one tag from [neutral|happy|laughing|angry|sad|crying|doubtful]. The server removes it before speaking and sends a matching face to the robot. It is a firmware requirement, not decoration: keep that instruction in any custom prompt (prompt.example.txt shows the exact wording). If the model omits it, the server uses neutral.

Advanced

Prebuilt images. ./warble start pulls prebuilt images from ghcr.io/rebelthor/warble-* (fast) and falls back to building locally if they can't be pulled. Force a local build with ./warble start --build. Pin a version with WARBLE_VERSION in .env. Publishing images: see docs/releasing.md.
Manual Docker control and the macOS/Linux Ollama details: see the comments in docker-compose.yml and .env.example.
Run without Docker (install whisper.cpp / Piper / Ollama yourself and use the bin/warble supervisor): see docs/operations.md.
Protocol details: server/PROTOCOL.md.

Components and licenses

warble's own code is MIT. It orchestrates external programs you install; see THIRD_PARTY_LICENSES.md. One boundary matters: Piper (piper-tts) is GPL-3.0. warble talks to it over HTTP, so warble stays MIT. The prebuilt warble-piper image bundles Piper, so that image is distributed under GPL-3.0 (labeled accordingly; its corresponding source is piper-shim/Dockerfile plus upstream piper1-gpl). The other three images (warble-server, warble-whisper, warble-vad) are MIT.

Protocol implemented from the xiaozhi-esp32 docs (WebSocket).

Contributing

Issues and PRs welcome. See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
bin		bin
docker		docker
docs		docs
nvs-tool		nvs-tool
piper-shim		piper-shim
scripts		scripts
server		server
vad-shim		vad-shim
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
docker-compose.yml		docker-compose.yml
warble		warble
warble.env.example		warble.env.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

warble

What you need

Quick start

Everyday use

Configuration

Use a different language

The emotion tag

Advanced

Components and licenses

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

warble

What you need

Quick start

Everyday use

Configuration

Use a different language

The emotion tag

Advanced

Components and licenses

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages