| title | Quickstart |
|---|---|
| description | From zero to first voice turn in 15 minutes. |
Get Dotty talking in 15 minutes. This is the single opinionated happy path -- see SETUP.md for build-from-source and alternative configurations.
| Item | Notes |
|---|---|
| M5Stack CoreS3 + StackChan servo kit | The robot. See hardware.md for details. |
| Linux or macOS host with Docker | Runs all four server-side containers. Any distro works. No GPU required for the default stack — see Server hardware below. |
| 2.4 GHz WiFi | The ESP32-S3 does not support 5 GHz. |
The default stack is CPU-only — no GPU is required. The voice pipeline
ships with FunASR SenseVoiceSmall for
ASR and Piper (LocalPiper) for TTS, both
of which run comfortably on a modern multi-core x86-64 or Apple Silicon CPU.
| Scenario | Needs a GPU? | Notes |
|---|---|---|
| Default (FunASR ASR + LocalPiper TTS + a cloud LLM via OpenRouter/OpenAI-compatible key) | No | Any 64-bit Linux/macOS host with Docker and ~4 GB free RAM. This is the Quickstart happy path. |
WhisperLocal ASR instead of FunASR |
Yes | faster-whisper float16 needs CUDA. This is the only reason the Quickstart compose file carries a runtime: nvidia block. |
| Self-hosting the LLM locally (Ollama / llama-swap instead of a cloud key) | Recommended | VRAM scales with the model — roughly ~5 GB for an 8B model, ~18 GB for a 30B. See run-fully-local.md and llama-swap-concurrent-models.md. CPU-only inference works but is slow. |
You do not have to touch the GPU config manually. make setup (step 4)
auto-detects the NVIDIA Docker runtime: if it's present, setup selects
WhisperLocal on the GPU; if it's absent, setup selects FunASR on the CPU
and strips the runtime: nvidia / NVIDIA_* blocks out of the rendered
docker-compose.yml. The # --- BEGIN/END CUDA BLOCK --- markers in
docker-compose.yml.template exist for exactly this.
If you render the compose file by hand instead of running
make setup, delete the two marked sections (# --- BEGIN CUDA BLOCK ---…# --- END CUDA BLOCK ---and# --- BEGIN CUDA ENV ---…# --- END CUDA ENV ---) on a host withoutnvidia-container-toolkit. Leaving them in is what produces thecould not select device driver "nvidia"error from Docker.
Download the latest release from
GitHub Releases
(look for a tag starting with fw-v). Grab all six binaries:
bootloader.bin, partition-table.bin, ota_data_initial.bin,
stack-chan.bin, generated_assets.bin, and human_face_detect.espdl.
Install esptool and flash over USB-C:
pip install esptool
python -m esptool --chip esp32s3 -b 460800 \
--before default_reset --after hard_reset \
write_flash --flash_mode dio --flash_size 16MB --flash_freq 80m \
0x0 bootloader.bin \
0x8000 partition-table.bin \
0xd000 ota_data_initial.bin \
0x20000 stack-chan.bin \
0xa60000 generated_assets.bin \
0xe70000 human_face_detect.espdlFlashing the bootloader (0x0) and partition table (0x8000) is
required — skip them and the device keeps whatever partition layout
the previous firmware left behind. That layout won't match these
images, and the robot boot-loops on No bootable app partitions.
Verify checksums against SHA256SUMS.txt in the release if desired.
git clone --recursive https://github.com/BrettKinny/dotty-stackchan.git
cd dotty-stackchancp .env.example .envEdit .env and set OPENROUTER_API_KEY=<YOUR_API_KEY> (or any
OpenAI-compatible key). You can skip this if you're running fully local
— either via Ollama (single binary, simple) or via llama-swap (Docker,
supports multiple resident models). See
cookbook/run-fully-local.md and
cookbook/llama-swap-concurrent-models.md.
The shipped .config.yaml selects PiVoiceLLM as the default LLM
provider, which runs the dotty-pi container (the pi coding agent)
on the same Docker host. One alternate provider — OpenAICompat
(any OpenAI-compatible cloud or local endpoint) — is available via
selected_module.LLM in data/.config.yaml.
make setupThe interactive wizard prompts for your server IP, robot name, timezone, and LLM provider. It downloads the ASR and TTS models (~100 MB), substitutes placeholders in config files, and starts the Docker containers.
Verify everything is healthy:
make doctorAll checks should pass (green). If any fail, see troubleshooting.md.
All four server-side services run as Docker containers on the same host.
docker compose up -d (from docker-compose.yml.template after make setup substitutes your placeholders) starts the main xiaozhi-esp32-server
container. The brain container and the perception/dashboard container
are brought up separately:
- dotty-pi (the voice-tool brain): see dotty-pi/README.md for build and run instructions.
- dotty-behaviour (perception bus + admin dashboard): see
dotty-behaviour/README.md for build and
run instructions. The
scripts/deploy-behaviour.shhelper deploys it. - bridge.py (admin dashboard service,
:8081): runs as a container on the same host (bridge/Dockerfile+bridge/docker-compose.yml, deployed viascripts/deploy-bridge-unraid.sh).
No separate host, no systemd bridge unit, no SSH to a second machine.
- Power on the robot (USB-C or battery).
- On the device screen, navigate to Settings > Advanced Options.
- Enter the OTA URL:
http://<XIAOZHI_HOST>:8003/xiaozhi/ota/ - The robot connects via WebSocket and shows a face.
Tap the screen to enter voice mode and say "Hello Dotty!"
You should see:
| LED colour | State |
|---|---|
| Green | Listening -- you are speaking |
| Orange | Thinking -- waiting for LLM response |
| Blue | Talking -- playing the response |
The face expression changes to match the response emoji. First-turn latency is roughly 5 seconds, dominated by the LLM round-trip.
- Change the persona -- give Dotty a different personality.
- Swap the voice -- try a different TTS voice.
- Run fully local -- Ollama compose profile, zero cloud dependencies.
- Run two local models concurrently -- keep a small voice model and a big "think" model both resident via llama-swap's matrix DSL.
- Disable Kid Mode -- for unrestricted use.
- Architecture overview -- full data flow.
- Kid Mode -- on by default, what it enforces.
This repo uses placeholders in place of real IPs, usernames, and filesystem paths. Substitute these everywhere before deploying:
| Placeholder | Meaning |
|---|---|
<XIAOZHI_HOST> |
LAN IP of the server running all Docker containers. The robot reaches this on WiFi, so it must be a LAN IP, not a Tailscale/VPN IP. |
<XIAOZHI_USER> |
SSH user for the server (whatever your distro defaults to: root, ubuntu, dietpi, etc.). |
<XIAOZHI_HOSTNAME> |
Hostname or Tailscale name of the server (optional, IP works for everything). |
<XIAOZHI_PATH> |
Path on the server where you clone/install this repo (e.g. /opt/xiaozhi-server/ or /srv/xiaozhi-server/). |
<YOUR_NAME> |
Your name / org, used in the persona prompt in .config.yaml. |
<ROBOT_NAME> |
Name the robot introduces itself as, referenced in the persona prompt in .config.yaml. Any string — pick whatever you want. The default example uses the hardware name ("StackChan"). |
Port numbers (8000, 8003, 8081, 8090) are product-generic and should not be changed unless you also reconfigure the respective services.
Files you will definitely need to edit before first run:
.config.yaml— replace<XIAOZHI_HOST>and customise theprompt:block.docker-compose.yml— setTZto your timezone.
All four containers run on the single Docker host (<XIAOZHI_HOST>):
| Container | Purpose | Port |
|---|---|---|
xiaozhi-esp32-server |
Voice pipeline: ASR, TTS, WebSocket to StackChan | 8000 (WS), 8003 (OTA/HTTP) |
dotty-pi |
pi coding agent — the voice-tool brain | internal (via docker exec) |
dotty-behaviour |
Perception bus + ambient consumers + calendar | 8090 |
bridge.py |
Admin dashboard | 8081 |
Container volume mounts for xiaozhi-esp32-server:
| Host path | Container path | Purpose |
|---|---|---|
data/.config.yaml |
/opt/xiaozhi-esp32-server/data/.config.yaml |
Config override (read-only mount) |
models/SenseVoiceSmall/ |
/opt/xiaozhi-esp32-server/models/SenseVoiceSmall/ |
ASR weights |
models/piper/ |
/opt/xiaozhi-esp32-server/models/piper/ |
Piper TTS voice models (.onnx + .json) |
tmp/ |
/opt/xiaozhi-esp32-server/tmp/ |
Scratch |
custom-providers/pi_voice/ |
/opt/xiaozhi-esp32-server/core/providers/llm/pi_voice/ |
PiVoiceLLM provider (directory mount) |
custom-providers/openai_compat/ |
/opt/xiaozhi-esp32-server/core/providers/llm/openai_compat/ |
OpenAICompat alternate provider |
custom-providers/edge_stream/edge_stream.py |
/opt/xiaozhi-esp32-server/core/providers/tts/edge_stream.py |
Streaming EdgeTTS provider (file mount) |
custom-providers/piper_local/piper_local.py |
/opt/xiaozhi-esp32-server/core/providers/tts/piper_local.py |
Local Piper TTS provider (file mount) |
custom-providers/asr/fun_local.py |
/opt/xiaozhi-esp32-server/core/providers/asr/fun_local.py |
Patched FunASR — adds language config key so SenseVoiceSmall can be pinned to English |
The full file inventory lives in architecture.md.
| What | URL | Who calls it |
|---|---|---|
| OTA (enter into StackChan settings) | http://<XIAOZHI_HOST>:8003/xiaozhi/ota/ |
The robot on boot |
| WebSocket | ws://<XIAOZHI_HOST>:8000/xiaozhi/v1/ |
The robot after OTA handshake |
| Perception / ambient events | http://<XIAOZHI_HOST>:8090 |
xiaozhi-server → dotty-behaviour |
| Admin dashboard | http://<XIAOZHI_HOST>:8081/ui |
Humans (LAN-only HTMX UI) |
| Bridge health | http://<XIAOZHI_HOST>:8081/health |
Humans, monitoring |
All containers use restart: unless-stopped. Ensure dockerd starts at
boot on your distro. Use docker compose restart or
docker restart <container> for transient restarts rather than docker compose down (which marks the container stopped and prevents
auto-restart on reboot).
# Tail xiaozhi-server logs (voice pipeline)
ssh <XIAOZHI_USER>@<XIAOZHI_HOST> 'docker logs -f xiaozhi-esp32-server'
# Tail dotty-behaviour logs (perception + dashboard)
ssh <XIAOZHI_USER>@<XIAOZHI_HOST> 'docker logs -f dotty-behaviour'
# Tail dotty-pi logs (brain container)
ssh <XIAOZHI_USER>@<XIAOZHI_HOST> 'docker logs -f dotty-pi'
# Restart voice pipeline after config change
ssh <XIAOZHI_USER>@<XIAOZHI_HOST> 'cd <XIAOZHI_PATH> && docker compose restart'
# Admin dashboard
open http://<XIAOZHI_HOST>:8081/ui
# Bridge health
curl http://<XIAOZHI_HOST>:8081/healthThe default TTS is LocalPiper (offline, runs inside the container). To change the Piper voice, edit TTS.LocalPiper.voice and the corresponding model_path / config_path in data/.config.yaml. To switch to cloud EdgeTTS instead, set selected_module.TTS: EdgeTTS and edit TTS.EdgeTTS.voice (any Microsoft Edge Neural voice ID works, e.g. en-US-AvaNeural). Restart the container after changes.
Edit personas/dotty_voice.md (loaded by the pi agent on the PiVoiceLLM path) and restart the relevant container. The prompt: key in data/.config.yaml is also injected as a secondary system message. Full instructions: cookbook/change-persona.md.
VAD.SileroVAD.min_silence_duration_ms in data/.config.yaml. Default: 700 ms. Lower = cuts off quicker. Higher = waits longer for slow speakers.
For the PiVoiceLLM path (default): see dotty-pi/README.md for the model selection rules — in particular, the llama-swap matrix DSL constraint that prevents the voice-model set from being evicted. For the OpenAICompat path: edit LLM.OpenAICompat.model (or repoint url / api_key) in data/.config.yaml and docker compose restart. Note: there is no live in-flight model-swap on either path — smart-mode model-swap is v2 scope and not wired (the instant hot-swap once provided by the removed Tier1Slim provider is gone).
make doctor # health checks
make logs # tail server logs
curl http://<XIAOZHI_HOST>:8081/health # test the bridge/dashboardSee troubleshooting.md for common issues.