title	Protocols
description	Xiaozhi WebSocket protocol, pi RPC transport, emotion frame format, and the HTTP APIs served by dotty-behaviour and bridge.py.

Protocols — what's on the wire

TL;DR

Xiaozhi WebSocket protocol — between device and xiaozhi-server. Opus audio + JSON control frames. Supports MCP over JSON-RPC 2.0 in-band. Canonical spec: github.com/78/xiaozhi-esp32/blob/main/docs/websocket.md.
Emotion channel — 21 upstream emotion identifiers; the server picks one from the LLM's leading emoji and emits a separate llm-type frame. This stack uses a 9-emoji subset.
MCP over WS — the device acts as an MCP server; xiaozhi-server calls tools/list and tools/call against it. Tool names use dotted namespaces like self.audio_speaker.set_volume.
pi RPC — PiClient ↔ the dotty-pi agent communicate as JSONL messages over the stdin/stdout of docker exec -i dotty-pi pi --mode rpc. This is the voice transport for the default PiVoiceLLM provider.
HTTP APIs — split across two services: dotty-behaviour (:8090) serves perception, vision, audio, and calendar endpoints; bridge.py (:8081) serves the admin dashboard /ui and admin routes.

Xiaozhi WebSocket

Transport. TLS-optional WebSocket. Our deploy uses plain ws:// on LAN. URL is given to the device via the OTA response on boot.

Handshake headers. The device sets Authorization, Protocol-Version, Device-Id, Client-Id on the upgrade request.

Hello (device → server)

{
  "type": "hello",
  "version": 1,
  "features": {"mcp": true, "aec": true},
  "transport": "websocket",
  "audio_params": {
    "format": "opus",
    "sample_rate": 16000,
    "channels": 1,
    "frame_duration": 60
  }
}

Device must receive a hello response within 10 s or it treats the channel as failed.

Hello response (server → device)

{
  "type": "hello",
  "transport": "websocket",
  "session_id": "xxx",
  "audio_params": {"format": "opus", "sample_rate": 24000}
}

The server picks the downlink sample rate (24 kHz above; uplink is 16 kHz from the device).

Message-type catalog

Type	Direction	Purpose
`hello`	device↔server	Handshake (see above)
`listen`	device→server	Mic state: `state: "start" \| "stop" \| "detect"`, `mode: "manual" \| "vad"`
`stt`	server→device	ASR result: `{"type":"stt","text":"…"}`
`tts`	server→device	TTS control: `state: "start" \| "stop" \| "sentence_start"` with optional `text` subtitle
`llm`	server→device	Emotion + leading emoji: `{"type":"llm","emotion":"happy","text":"😀"}` — see emotion protocol
`mcp`	both	MCP JSON-RPC payload wrapped in `{"type":"mcp","payload":{…}}`
`system`	server→device	Device control, e.g. `{"command":"reboot"}`
`alert`	server→device	Notification, e.g. `{"status":"Warning","message":"Battery low","emotion":"sad"}`
`abort`	device→server	e.g. `{"reason":"wake_word_detected"}` to interrupt a response

Binary audio framing

Audio travels on the same WebSocket as binary frames. There are three defined framings — the device/server negotiate which one during hello.

Version 1 — raw Opus payload, no metadata.

Version 2 (BinaryProtocol2):

struct BinaryProtocol2 {
    uint16_t version;
    uint16_t type;           // 0 = Opus, 1 = JSON
    uint32_t reserved;
    uint32_t timestamp;      // milliseconds (used for AEC alignment)
    uint32_t payload_size;
    uint8_t  payload[];
} __attribute__((packed));

Version 3 (BinaryProtocol3):

struct BinaryProtocol3 {
    uint8_t  type;
    uint8_t  reserved;
    uint16_t payload_size;
    uint8_t  payload[];
} __attribute__((packed));

Default audio params. Opus, mono, 16 kHz uplink / 24 kHz downlink, 60 ms frame duration.

Keepalive and closure

The spec does not mandate a keepalive. Closure is driven by device CloseAudioChannel() or server disconnect; the firmware returns to idle.

Emotion protocol

From xiaozhi.dev/en/docs/development/emotion/.

Full upstream emotion catalog (21 identifiers)

Emoji	Identifier
😶	`neutral`
🙂	`happy`
😆	`laughing`
😂	`funny`
😔	`sad`
😠	`angry`
😭	`crying`
😍	`loving`
😳	`embarrassed`
😲	`surprised`
😱	`shocked`
🤔	`thinking`
😉	`winking`
😎	`cool`
😌	`relaxed`
🤤	`delicious`
😘	`kissy`
😏	`confident`
😴	`sleepy`
😜	`silly`
🙄	`confused`

Wire format

Server emits a dedicated llm-type frame:

{"session_id":"xxx","type":"llm","emotion":"happy","text":"🙂"}

text contains the emoji character; emotion contains the identifier. The TTS frame that follows has the emoji stripped from its text so the speaker doesn't try to read it aloud.

Default emoji allowlist

The persona prompt and xiaozhi-server's top-level prompt: block enforce the following 9-emoji subset:

😊 😆 😢 😮 🤔 😠 😐 😍 😴

Smaller set = more predictable face animations, fewer corner-cases in the xiaozhi emoji-stripper.

Two-layer enforcement

Persona prompt (personas/dotty_voice.md) — asks for a leading emoji.
xiaozhi-server top-level prompt: — also asks for a leading emoji.

(A third bridge-side _ensure_emoji_prefix fallback existed in the retired ZeroClaw voice path; it is not present in the current PiVoiceLLM path.)

MCP tools over WS

From github.com/78/xiaozhi-esp32/blob/main/docs/mcp-protocol.md.

Device signals MCP support in hello.features.mcp = true. Server then queries the device for its tool list.

`tools/list` request (server → device)

{
  "session_id": "…",
  "type": "mcp",
  "payload": {
    "jsonrpc": "2.0",
    "method": "tools/list",
    "params": {"cursor": "", "withUserTools": false},
    "id": 2
  }
}

`tools/list` response (device → server)

{
  "session_id": "…",
  "type": "mcp",
  "payload": {
    "jsonrpc": "2.0",
    "id": 2,
    "result": {
      "tools": [
        {"name": "self.get_device_status", "description": "…", "inputSchema": {…}}
      ],
      "nextCursor": "…"
    }
  }
}

`tools/call` request

{
  "session_id": "…",
  "type": "mcp",
  "payload": {
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
      "name": "self.audio_speaker.set_volume",
      "arguments": {"volume": 50}
    },
    "id": 3
  }
}

Success / error response

{"jsonrpc":"2.0","id":3,"result":{"content":[{"type":"text","text":"true"}],"isError":false}}

Tool visibility — public vs user-only

McpServer::AddTool — regular tool, exposed to tools/list by default. Available to the AI.
McpServer::AddUserOnlyTool — hidden from the default tools/list. Requires withUserTools: true. For privileged actions the LLM shouldn't trigger (e.g. reboot).

See hardware.md for the default 11-tool MCP surface.

pi RPC — PiVoiceLLM transport

The PiVoiceLLM provider communicates with the dotty-pi agent via pi RPC mode: JSONL messages exchanged over the stdin/stdout of a docker exec invocation.

xiaozhi-server
  └─ PiClient
       └─ docker exec -i dotty-pi pi --mode rpc …
                             │           ▲
                    JSONL request        │
                    (stdin)              │ JSONL response
                                        │ (stdout, streamed)

Each turn is a single JSONL object written to stdin; the agent streams JSONL response chunks back on stdout. Only TTS-bound text chunks are forwarded to xiaozhi-server — tool call details stay internal to the agent loop. The agent exits cleanly after each turn; PiClient re-invokes docker exec for the next turn.

The dotty-pi agent loads the dotty-pi-ext extension at startup, which registers the five voice tools (memory_lookup, remember, think_hard, take_photo, play_song). Tool results never appear in the TTS stream.

HTTP APIs

Server-side HTTP is split across two services. All payloads are JSON unless noted.

dotty-behaviour — perception, vision, audio, calendar (:8090)

dotty-behaviour is a FastAPI service (port 8090, same Docker host) that owns the ambient behaviour layer.

Endpoint	Purpose
`POST /api/perception/event`	xiaozhi → dotty-behaviour perception relay (face, sound, state events)
`POST /api/vision/explain`	VLM describe-image call
`POST /api/audio/explain`	Audio event explanation
`POST /api/voice/take_photo`	Voice-triggered camera snapshot + VLM describe
`GET /api/calendar/*`	Calendar context queries

POST /api/perception/event is the primary inbound path for firmware event frames forwarded by EventTextMessageHandler in custom-providers/xiaozhi-patches/textMessageHandlerRegistry.py:

{
  "name": "<face_detected|face_lost|sound_event|state_changed|dance_started|dance_ended|chat_status|…>",
  "data": {"…": "…"},
  "device_id": "<xiaozhi device-id>",
  "session_id": "<xiaozhi session id>",
  "ts": 1715000000.0
}

Response: {"ok": true}. dotty-behaviour broadcasts the event to all perception listeners and updates per-device state (dotty-behaviour/perception/state.py). See architecture.md for the 11 consumer classes (the running set is config-gated).

bridge.py — dashboard and admin (:8081)

bridge.py is a FastAPI service (port 8081, same Docker host) that serves the admin dashboard. Its voice and perception relay roles were retired in issue #36 (2026-05-19); it survives as the dashboard service.

Endpoint	Purpose
`GET /ui`	Admin dashboard web UI
`POST /admin/*`	Admin mutations (toggle, kid-mode, smart-mode, play-asset, etc.)
`GET /health`	Liveness probe; returns `{"ok": true}`

POST /api/voice/escalate is also defined on bridge.py but is non-functional in the current stack — the ZeroClaw voice dispatch layer it depended on was retired in #36, and the only consumer (the Tier1Slim provider) was removed in the 2026-05-29 alignment pass. See docs/cutover-behaviour.md for the historical runbook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protocols — what's on the wire

TL;DR

Xiaozhi WebSocket

Hello (device → server)

Hello response (server → device)

Message-type catalog

Binary audio framing

Keepalive and closure

Emotion protocol

Full upstream emotion catalog (21 identifiers)

Wire format

Default emoji allowlist

Two-layer enforcement

MCP tools over WS

Advertisement

`tools/list` request (server → device)

`tools/list` response (device → server)

`tools/call` request

Success / error response

Tool visibility — public vs user-only

pi RPC — PiVoiceLLM transport

HTTP APIs

dotty-behaviour — perception, vision, audio, calendar (:8090)

bridge.py — dashboard and admin (:8081)

See also

FilesExpand file tree

protocols.md

Latest commit

History

protocols.md

File metadata and controls

Protocols — what's on the wire

TL;DR

Xiaozhi WebSocket

Hello (device → server)

Hello response (server → device)

Message-type catalog

Binary audio framing

Keepalive and closure

Emotion protocol

Full upstream emotion catalog (21 identifiers)

Wire format

Default emoji allowlist

Two-layer enforcement

MCP tools over WS

Advertisement

tools/list request (server → device)

tools/list response (device → server)

tools/call request

Success / error response

Tool visibility — public vs user-only

pi RPC — PiVoiceLLM transport

HTTP APIs

dotty-behaviour — perception, vision, audio, calendar (:8090)

bridge.py — dashboard and admin (:8081)

See also

`tools/list` request (server → device)

`tools/list` response (device → server)

`tools/call` request