diff --git a/CHANGELOG.md b/CHANGELOG.md index a187812..3b2658f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,19 @@ status quo, `2` once it is not backwards compatible. Entries are grouped under the spec's current class. Every pull request that alters normative content adds an entry here. +## OVOS-AUDIO-1 — Audio Output Service + +### 2 + +- The audio output service: the rendering pipeline (dialog-transformer + chain, TTS synthesis, TTS-transformer chain, playback queue), the + sequential playback queue shared by speech (`ovos.utterance.speak`) and + sound effects (`ovos.audio.queue` / `ovos.audio.play_sound`), the + remote-client rendering mode (`ovos.utterance.speak.b64` → + `ovos.audio.speech`), output lifecycle signals + (`ovos.audio.output.started` / `.ended`), the speaking-status query + (`ovos.audio.is_speaking`), stop integration (`ovos.audio.stop`, + `ovos.stop`), and the `listen`-triggered `ovos.mic.listen` follow-up. ## OVOS-PERSONA-1 — Persona Pipeline Plugin ### 2 diff --git a/appendix/divergences.md b/appendix/divergences.md index 328aecf..33c7477 100644 --- a/appendix/divergences.md +++ b/appendix/divergences.md @@ -195,6 +195,21 @@ defined by any spec** and should be removed or replaced: - **`ovos.utterance.speak`** (PIPELINE-1 §9.6). The NL output exit point; symmetric to `ovos.utterance.handle`. No current equivalent — TTS trigger is currently implicit. +- **`ovos.utterance.speak.b64`** (AUDIO-1 §3.4). Variant of + `ovos.utterance.speak` for remote-client delivery: the audio + output service runs the same TTS pipeline but emits synthesised + audio as base64 via `ovos.audio.speech` instead of queuing for + local playback. Used by bridges serving satellites without TTS + (BRIDGE-1 §4.2.4). +- **`ovos.audio.speech`** (AUDIO-1 §4.3). Base64-encoded + synthesised audio broadcast; emitted in response to + `ovos.utterance.speak.b64`. Carries a `listen` flag. Remote + clients (e.g. satellites relayed by a bridge) decode and play + the audio themselves. +- **`ovos.audio.queue`** / **`ovos.audio.play_sound`** (AUDIO-1 + §4.1, §4.2). Sound-effect playback topics. Payloads accept + either a `uri` or inline base64 `audio` field, enabling + cross-host audio delivery without shared filesystem access. - **`ovos.intent.list` / `ovos.intent.describe`** (INTENT-4 §10). Introspection topics served from the orchestrator's passive registration index. diff --git a/appendix/rationale.md b/appendix/rationale.md index 72bb31f..7d0aaa5 100644 --- a/appendix/rationale.md +++ b/appendix/rationale.md @@ -680,6 +680,25 @@ and selects; the skill stops. Stop is one of the few cases in the spec set where the pipeline / skill split is not substitutable. + +### 4.9 Audio output service (AUDIO-1) + +**Sentence segmentation as a latency-reduction technique (AUDIO-1 §3.2).** +When a TTS engine synthesises a long utterance as a single unit, the +user must wait for the entire synthesis to complete before hearing +anything. An implementation can reduce perceived latency by splitting +the utterance at sentence boundaries, synthesising each sentence +independently, and enqueuing each segment as soon as it is ready — +so the first sentence begins playing while later sentences are still +being synthesised. + +This is an internal implementation strategy: no other bus participant +observes whether the TTS engine segments or not. The visible contract +is unchanged — `ovos.audio.output.started` fires when the first +audio begins, `ovos.audio.output.ended` fires when the last audio +completes. The `listen` flag is honoured after all audio for the +originating utterance has played, regardless of how many internal +segments were used. ### 4.10 Common query pipeline plugin (COMMON-QUERY-1) Common query answers factual questions by holding a timed contest diff --git a/audio-out.md b/audio-out.md new file mode 100644 index 0000000..ccbfc30 --- /dev/null +++ b/audio-out.md @@ -0,0 +1,418 @@ +# Audio Output Service Specification + +**Spec ID:** OVOS-AUDIO-1 · **Version:** 2 · **Status:** Draft + +This specification defines the **audio output service** — the +pipeline's output-side counterpart that consumes natural-language +responses and renders them as audio. It covers two rendering modes +(`ovos.utterance.speak` for local playback and +`ovos.utterance.speak.b64` for remote-client delivery), a sequential +playback queue for speech and sound effects, fire-and-forget instant +sounds, and the output lifecycle signals that bookend audio playback. + +It builds on three companion specifications: + +- the *Utterance Lifecycle and Pipeline Specification* + (OVOS-PIPELINE-1) — the pipeline iteration, the `Match` and + dispatch contract, the handler-lifecycle trio, and the + `ovos.utterance.speak` output exit point; +- the *Bus Message Specification* (OVOS-MSG-1) — the envelope, + routing keys, session carrier, and derivations every Message + defined here travels in; +- the *Transformer Injection Point Specification* + (OVOS-TRANSFORM-1) — the dialog-transformer and TTS-transformer + chains that run before and after TTS synthesis. + +The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, +and **MAY** are used as in RFC 2119. + +--- + +## 1. Scope + +This specification defines: + +- **the audio output service role** (§2) — the component that + receives natural-language responses and renders them as audio; +- **the rendering pipeline** (§3) — two rendering modes sharing the + same TTS pipeline: `ovos.utterance.speak` enqueues for local + playback; `ovos.utterance.speak.b64` emits synthesised audio as + base64 for remote clients instead; +- **the playback model** (§4) — the scheduled queue for TTS + speech and queued sounds, and fire-and-forget instant sounds + for immediate playback; +- **output lifecycle signals** (§5) — the start/end markers that + bookend audio playback; +- **stop integration** (§6) — how the audio service responds to + stop signals; +- **bus surface** (§7); +- **conformance** (§8). + +It does **not** define: + +- **the internal machinery of TTS synthesis** — how a TTS plugin + converts text to audio, including model inference, voice + selection, and audio formatting, is entirely the plugin's + business. The spec fixes only the observable bus contract; +- **the transformer plugin internals** — dialog and TTS + transformer chains are defined by OVOS-TRANSFORM-1; this spec + only fixes when they run in the output pipeline; +- **the audio-input pipeline** — microphone capture, wake-word + detection, and speech-to-text are separate services covered by + other specifications; +- **hardware access** — how the service accesses audio output + hardware is a deployment concern; +- **volume control, audio routing, or hardware abstraction** — + these are deployment-level concerns; +- **music and media playback** — long-form audio is managed by a + separate media-playback service. This spec covers TTS speech and + sound effects only. + +--- + +## 2. The audio output service role + +The **audio output service** is the component that receives +natural-language response text from the pipeline and renders it as +audible output. It: + +- subscribes to `ovos.utterance.speak` (OVOS-PIPELINE-1 §9.6) and + `ovos.utterance.speak.b64` (§3.4) and processes each through the + same TTS rendering pipeline (§3), differing only in output stage; +- maintains a **scheduled playback queue** (§4.1) for TTS speech + and queued sounds, ensuring that audio is played back in order + without overlapping; +- plays **instant sounds** (§4.2) immediately on receipt, + independently of the scheduled queue and without stopping it; +- emits **output lifecycle signals** (§5) around each playback + session; +- responds to **stop signals** (§6) by clearing the queue and + terminating in-progress playback. + +A deployment **MAY** have no audio output service. The pipeline +and handler lifecycle are unaffected by its absence. + +The handler does not block on audio output; playback may occur after +`ovos.utterance.handled` has fired (PIPELINE-1 §6.1). + +--- + +## 3. Rendering pipeline + +Both `ovos.utterance.speak` and `ovos.utterance.speak.b64` pass +through the same TTS pipeline. They differ only in the output stage: + +``` +ovos.utterance.speak ovos.utterance.speak.b64 + │ │ + ▼ ▼ + [dialog transformers] [dialog transformers] ← TRANSFORM-1 §3.5 + │ │ + ▼ ▼ + TTS synthesis TTS synthesis + │ │ + ▼ ▼ + [tts transformers] [tts transformers] ← TRANSFORM-1 §3.6 + │ │ + ▼ ▼ + scheduled queue ovos.audio.speech (§4.3) + → local playback (b64 for remote client) +``` + +All rendering stages execute in the audio output service, which MAY +run in the same process as the utterance orchestrator or separately. + +### 3.1 Dialog transformer stage + +Before TTS synthesis, the utterance text is passed through the +**dialog-transformer chain** (OVOS-TRANSFORM-1 §3.5) hosted by the +audio output service. Each transformer plugin in the chain receives +the text and the Message context and MAY mutate either. + +The transformed text replaces the original `utterance` field for +all downstream stages. + +### 3.2 TTS synthesis + +The audio output service synthesises the utterance text into audio. +Language is taken from `data.lang` in the received Message +(PIPELINE-1 §9.6); when absent, the service resolves it from the +session (OVOS-SESSION-1 §3.2). + +When synthesis fails, the service **SHOULD** attempt a fallback. +Selection and fallback logic are deployment concerns. + +For `ovos.utterance.speak`, the synthesised audio is enqueued for +local playback (§4). For `ovos.utterance.speak.b64`, the synthesised +audio is emitted as `ovos.audio.speech` (§3.4) instead — it is not +enqueued and does not play locally. + +> **Note (non-normative):** See appendix §4.9 for a discussion of +> sentence-segmentation as a latency-reduction technique. + +### 3.3 TTS transformer stage + +After synthesis, the audio data and Message context are passed +through the **TTS-transformer chain** (OVOS-TRANSFORM-1 §3.6) +hosted by the audio output service. Each transformer plugin MAY mutate the audio data. + +The transformed audio replaces the original for playback. + +### 3.4 Remote-client rendering mode — `ovos.utterance.speak.b64` + +The audio output service **MUST** subscribe to +`ovos.utterance.speak.b64`. A Message on this topic carries the same +`utterance` text as `ovos.utterance.speak` and passes through the +same dialog-transformer, TTS-synthesis, and TTS-transformer stages +(§3.1–§3.3). The output stage differs: instead of enqueueing for +local playback, the service **MUST** emit `ovos.audio.speech` (§4.3) +with the synthesised audio encoded as base64. The audio is not +enqueued and does not play on the local device. + +The `listen` flag (§4.4) applies: if the originating Message carries +`listen: true`, the service **MUST** emit `ovos.mic.listen` after +emitting `ovos.audio.speech`. + +--- + +## 4. Playback model + +The audio output service has one scheduled queue and a separate +instant-sound mechanism: + +- **Scheduled playback queue** (§4.1) — sequential, one-at-a-time + playback for TTS speech and queued sound effects. Audio plays in + FIFO order without overlapping. +- **Instant sounds** (§4.2) — fire-and-forget playback that starts + immediately on receipt. Instant sounds are independent of the + queue: they play over whatever is currently scheduled, MAY overlap + each other, and are not stoppable. + +### 4.1 Scheduled playback queue + +This queue holds TTS speech (from `ovos.utterance.speak`, §3.2) +and queued sounds (from `ovos.audio.queue`, below). + +**Session scope.** The audio output service MUST only enqueue items +whose `context.session.session_id` matches a session it is +configured to serve locally. A service co-located with the +orchestrator on a single device SHOULD serve only +`session_id: "default"` (**OVOS-SESSION-2 §5**) and MUST NOT +enqueue audio for named sessions — those sessions belong to remote +participants and their audio is delivered via +`ovos.utterance.speak.b64` / `ovos.audio.speech` (§3.4, §4.3). + +**Discipline:** +- **FIFO**. Items are dequeued in the order they were enqueued. +- **Sequential**. Each item plays to completion before the next + item begins. +- **Clearable**. On a stop signal (§6), the queue is emptied of + all pending items and any in-progress playback is terminated. + +**Queued sounds** use topic `ovos.audio.queue`: + +| Field | Type | Required | Meaning | +|-------|------|----------|---------| +| `uri` | string | no | URI referencing the audio data. | +| `audio` | string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to `uri`). | +| `listen` | bool | no | When `true`, re-opens the user input channel after this item plays (§4.4). | + +Exactly one of `uri` or `audio` MUST be present. + +### 4.2 Instant sounds + +Instant sounds are played via `ovos.audio.play_sound`. They start +immediately on receipt, play over any audio currently in progress +from the scheduled queue, MAY overlap each other, and are **not** +affected by stop signals (§6). + +**Play-sound topic** `ovos.audio.play_sound`: + +| Field | Type | Required | Meaning | +|-------|------|----------|---------| +| `uri` | string | no | URI referencing the audio data. | +| `audio` | string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to `uri`). | + +Exactly one of `uri` or `audio` MUST be present. + +### 4.3 Synthesised audio delivery — `ovos.audio.speech` + +`ovos.audio.speech` is emitted by the audio output service when +processing an `ovos.utterance.speak.b64` Message (§3.4). It carries +the synthesised audio as base64; the receiving client is responsible +for decoding and playing it. + +| Field | Type | Required | Meaning | +|-------|------|----------|---------| +| `audio` | string | yes | Base64-encoded synthesised audio. | +| `listen` | bool | no | When `true`, the client SHOULD re-open its microphone after playback. | + +The session is identified via `context.session` as usual. A bridge +(OVOS-BRIDGE-1 §4.2.4) subscribes by `session_id` or `destination` +and relays this message to the client. + +### 4.4 Listen flag + +The `listen` field on `ovos.utterance.speak` is defined by +OVOS-PIPELINE-1 §9.6. When a received Message carries `listen: true`, +the audio output service **MUST** emit `ovos.mic.listen` after all +audio for that utterance has completed and after +`ovos.audio.output.ended` (§5.2). + +On a stop-initiated end (§6), `ovos.mic.listen` is **NOT** emitted +regardless of the `listen` flag. + +--- + +## 5. Output lifecycle signals + +The audio output service emits lifecycle signals around playback +to notify other components of audio state. + +### 5.1 Playback start + +When the first item in a playback session begins (queue was empty, +first item dequeued), the audio output service **MUST** emit: + +`ovos.audio.output.started` + +Payload: + +No payload. The session is identified by `context.session.session_id` +of this Message. + +A playback session runs from the first item's start until the queue +is empty and the last item completes. `ovos.audio.output.started` +fires once per idle→active transition. + +### 5.2 Playback end + +When the queue becomes empty and the last item has completed +playback, the audio output service **MUST** emit: + +`ovos.audio.output.ended` + +Payload: + +No payload. The session is identified by `context.session.session_id` +of this Message. + +Components that subscribed to `ovos.audio.output.started` use this +signal to restore state. + +If the last completed item carried `listen: true` (§4.4), the audio +output service emits `ovos.mic.listen` **after** `ovos.audio.output.ended`. +On a stop-initiated end, `ovos.mic.listen` is not emitted (§4.4). + +### 5.3 Speaking-status query + +A component MAY query whether the audio output service is +currently speaking by emitting: + +`ovos.audio.is_speaking` + +Request payload: none. To scope the query to a specific session, +the requester sets `context.session.session_id` in the request +Message; the service answers for that session only. An absent or +`"default"` `session_id` asks about the device-local default session +(OVOS-SESSION-1 §3.1); it is not a wildcard over all sessions. + +The service replies with: + +```json +{ "speaking": true } +``` + +| Field | Type | Required | Meaning | +|-------|------|----------|---------| +| `speaking` | bool | yes | Whether audio is currently playing for the session identified by `context.session.session_id` of the request. | + +--- + +## 6. Stop integration + +When the audio output service receives a stop signal, it: + +1. **clears** the scheduled playback queue of all pending items; +2. **terminates** any in-progress scheduled playback; +3. **emits** `ovos.audio.output.ended` if a playback session was + active. + +Instant sounds (§4.2) are not affected by stop signals — they play +to completion regardless. + +The stop signal topics are: + +| Topic | Purpose | +|-------|---------| +| `ovos.audio.stop` | Stop audio output. | +| `ovos.stop` | Universal stop broadcast (OVOS-STOP-1). | + +Both signals carry `context.session.session_id` (OVOS-MSG-1 §4). +The audio output service **MAY** scope its response to that session. + +--- + +## 7. Bus surface + +| Topic | Direction | Purpose | +|-------|-----------|---------| +| `ovos.utterance.speak` | handler → audio | Natural-language response text for TTS + local playback (PIPELINE-1 §9.6). | +| `ovos.utterance.speak.b64` | handler/bridge → audio | Natural-language response text for TTS + remote delivery via `ovos.audio.speech` (§3.4). | +| `ovos.audio.queue` | any component → audio | Queue a sound for scheduled playback (§4.1). | +| `ovos.audio.play_sound` | any component → audio | Play a sound immediately (§4.2). | +| `ovos.audio.stop` | any component → audio | Stop audio playback and clear queue (§6). | +| `ovos.audio.is_speaking` | any component → audio | Query whether audio is currently playing (§5.3). | +| `ovos.audio.output.started` | audio → broadcast | Playback session started (§5.1). | +| `ovos.audio.output.ended` | audio → broadcast | Playback session ended (§5.2). | +| `ovos.audio.speech` | audio → broadcast | Synthesised audio as base64 for remote clients (§4.3). | +| `ovos.mic.listen` | audio → broadcast | Request microphone re-open after `listen: true` (§4.4). | + +--- + +## 8. Conformance + +### An audio output service **MUST**: + +- subscribe to `ovos.utterance.speak` and process each Message + through the TTS rendering pipeline for local playback (§3); +- subscribe to `ovos.utterance.speak.b64` and process each Message + through the same TTS pipeline, emitting `ovos.audio.speech` + instead of enqueueing for local playback (§3.4); +- maintain a scheduled playback queue that plays one item at a + time in FIFO order (§4.1); +- support queued sound playback via `ovos.audio.queue` (§4.1); +- play instant sounds immediately on `ovos.audio.play_sound` without + queuing or stopping scheduled playback (§4.2); +- emit `ovos.audio.output.started` when a playback session begins + (§5.1); +- emit `ovos.audio.output.ended` when a playback session ends (§5.2); +- clear the scheduled queue and terminate playback on stop signals (§6); +- emit `ovos.mic.listen` after playback when the last item carries + `listen: true` (§4.4); +- suppress `ovos.mic.listen` when playback ends due to a stop signal (§4.4, §6). + +### An audio output service **SHOULD**: + +- pass utterance text through the dialog-transformer chain before + TTS synthesis (§3.1); +- pass the synthesized audio through the TTS-transformer chain + before enqueueing (§3.3); + +### An audio output service **MAY**: + +- scope stop responses to the `context.session.session_id` in the stop signal (§6). + +--- + +## See also + +- *Utterance Lifecycle and Pipeline Specification* (OVOS-PIPELINE-1) + — the pipeline iteration, `ovos.utterance.speak`, and `ovos.utterance.handled`. +- *Bus Message Specification* (OVOS-MSG-1) — the envelope and + derivations used for all bus communication. +- *Transformer Injection Point Specification* (OVOS-TRANSFORM-1) — + the dialog-transformer and TTS-transformer chains that plug into + the rendering pipeline. +- *Stop Pipeline Plugin Specification* (OVOS-STOP-1) — the universal + `ovos.stop` broadcast that the audio output service responds to.