From 9188c044fe46bc51ce8e17dbb884576e919aa4bb Mon Sep 17 00:00:00 2001 From: Lorna Armstrong Date: Wed, 18 Mar 2026 10:48:09 +0000 Subject: [PATCH 1/2] Restructure and add missing messages --- docs/private/voice-agent-api.mdx | 514 +++++++++++++++++++------------ 1 file changed, 311 insertions(+), 203 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 869444a..08a5e3c 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -14,133 +14,140 @@ description: Early access to the Voice Agent API — a turn-based API built for ## Introduction -The Voice Agent API is a turn-based API built for voice agents. It is designed for developers building low-latency integrations between speech and LLMs — with turn detection, speaker awareness, and segment-based output built in so you can focus on your agent logic. +The Voice Agent API is a WebSocket API for building voice agents. Stream audio in and receive speaker-labelled, turn-based transcription back — clean, punctuated, and ready to pass directly to an LLM. ---- - -## What it does - -The Voice Agent API is a turn-based API. Rather than a stream of word-level events, speech is grouped into segments — and turn detection determines when a speaker has finished, triggering fast finalisation of those segments. - -You receive: - -- `StartOfTurn` — when a speaker begins talking -- `AddPartialSegment` — interim transcript updates as they speak -- `AddSegment` — the final, complete transcript for that turn -- `EndOfTurn` — when the turn is complete - -When a turn ends, you receive an `AddSegment` containing the finalised utterance. In multi-speaker scenarios, a single message may contain segments from multiple speakers, returned in time order: - -```json -{ - "message": "AddSegment", - "segments": [ - { - "speaker_id": "S1", - "is_active": true, - "timestamp": "2025-01-01T12:00:00.000+00:00", - "language": "en", - "text": "Welcome to Speechmatics.", - "is_eou": true, - "metadata": { - "start_time": 0.84, - "end_time": 1.56 - } - }, - { - "speaker_id": "S2", - "is_active": true, - "timestamp": "2025-01-01T12:00:02.000+00:00", - "language": "en", - "text": "Thank you for testing the Voice Agent API.", - "is_eou": true, - "metadata": { - "start_time": 2.10, - "end_time": 3.80 - } - } - ], - "metadata": { - "start_time": 0.84, - "end_time": 3.80, - "processing_time": 0.25 - } -} -``` +Turn detection runs server-side. Choose a [profile](#profiles) based on your use case and the API handles when to finalise each speaker's turn. -Each segment's `text` field is clean, punctuated, and ready to use. When a message contains multiple segments, you'll need to concatenate them. The SDK reconstructs the exchange using `speaker_id` and `is_active` — non-active speakers (outside your focus list) are marked as `[background]`: - -```python -' '.join([f"@{s.speaker_id}{'' if s.is_active else ' [background]'}: {s.text}" for s in segments]) -``` - -Which produces: - -``` -@S1: Hello there. @S2 [background]: It was yesterday. @S1: How are you getting on? -``` - -No accumulating partials, no stitching words together, no guessing when the speaker has finished. The turn detection handles all of that, so your agent can respond as fast as possible. +To jump straight into code, see working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for both Python and JavaScript. --- ## Profiles -Profiles are pre-tuned configurations for voice agents. Each profile sets the right defaults for turn detection, latency, and endpointing — no need to configure the API settings yourself. +Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, and the server handles the rest. -Choose the profile that best fits your use case: +| Profile | Turn detection | Best for | +|---------|---------------|----------| +| `agile` | VAD-based silence detection | Speed-first use cases | +| `adaptive` | Adapts to speaker pace and hesitation | General conversational agents | +| `smart` | `adaptive` + ML acoustic turn prediction | High-stakes conversations | +| `external` | Manual — you trigger turn end | Push-to-talk, custom VAD, LLM-driven | ### `agile` **Endpoint:** `/v2/agent/agile` -Lowest end-of-speech to final latency. Uses voice activity detection to finalise turns as quickly as possible. +Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. -**Best for:** Use cases where response speed is the top priority. +**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. -**Trade-off:** May produce more finalised segments mid-speaker, which can result in additional downstream LLM calls. - ---- +**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. ### `adaptive` **Endpoint:** `/v2/agent/adaptive` -Adapts to each speaker over the course of a conversation. Waits longer for slow speakers or those who hesitate frequently. Works with all languages. +Adapts to each speaker's pace over the course of a conversation. It adjusts the turn-end threshold based on speech rate and disfluencies (e.g. hesitations, filler words), waiting longer for speakers who tend to pause mid-thought. **Best for:** General conversational voice agents. -**Trade-off:** Latency is not consistently the fastest. Disfluency/hesitation detection is English-only — other languages use speech-rate adaptation only. - ---- +**Trade-off:** Latency varies by speaker. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. ### `smart` **Endpoint:** `/v2/agent/smart` -Builds on `adaptive` and additionally analyses vocal tone to improve turn completion. The most conservative profile. +Builds on `adaptive` with an additional ML model that analyses acoustic cues to predict whether a speaker has genuinely finished their turn. The most conservative profile — least likely to interrupt. -**Best for:** High-stakes conversations where interrupting the user is costly (finance, healthcare, legal). +**Best for:** High-stakes conversations where cutting off the user is costly — finance, healthcare, legal. **Trade-off:** Higher latency than `adaptive`. Supported languages: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese. ---- - ### `external` **Endpoint:** `/v2/agent/external` -You control when a turn ends. Send a `ForceEndOfUtterance` message to trigger finalisation — the server will return a combined segment of everything spoken up to that point. +Turn detection is fully manual. The server accumulates audio and transcript until you send a `ForceEndOfUtterance` message, at which point it finalises everything spoken up to that point and emits an `AddSegment`. + +**Best for:** Push-to-talk interfaces, custom VAD pipelines, or setups where an LLM decides when to respond. + +**Trade-off:** You are responsible for all turn detection logic. -**Best for:** Push-to-talk, custom VAD, or LLM-driven turn detection. +--- + +## Session Flow + +Every session follows the same structure: connect, start recognition, stream audio, receive turn events, close. + +```mermaid +sequenceDiagram + participant C as Client + participant S as Server + + C->>S: Connect to endpoint with profile via WebSocket + C->>S: StartRecognition + S-->>C: RecognitionStarted + + loop Audio Stream + C->>S: Audio frames (binary) + S-->>C: AudioAdded + S-->>C: StartOfTurn + S-->>C: AddPartialSegment + S-->>C: AddSegment + S-->>C: EndOfTurn + + opt Optional — speaker events + S-->>C: SpeakerStarted / SpeakerEnded + S-->>C: SessionMetrics / SpeakerMetrics + end + + opt Optional — mid-session controls + C->>S: ForceEndOfUtterance (external profile only) + C->>S: UpdateSpeakerFocus + C->>S: GetSpeakers + S-->>C: SpeakersResult + end + end + + C->>S: EndOfStream + S-->>C: EndOfTranscript +``` + +**Client → Server** + +| Message | When to send | +|---------|-------------| +| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | +| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously while audio is available. | +| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Signals that the current turn is complete. | +| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | +| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for enrolled speakers. | +| [`EndOfStream`](#endofstream) | When there is no more audio to send. | -**Trade-off:** Most complex to implement — you are responsible for turn detection logic. +**Server → Client** + +| Message | When it's emitted | +|---------|------------------| +| [`RecognitionStarted`](#standard-messages) | Session is ready. | +| [`StartOfTurn`](#startofturn) | A speaker begins a new turn. | +| [`AddPartialSegment`](#addpartialsegment) | Interim transcript update. Replaces the previous partial. | +| [`AddSegment`](#addsegment) | Final transcript for the turn. Send this to your LLM. | +| [`EndOfTurn`](#endofturn) | Turn is complete. Your agent can now respond. | +| [`EndOfTranscript`](#standard-messages) | All audio processed. Emitted after `EndOfStream`. | --- -## Getting started +## Getting Started + +### 1. Connect -### Authentication +Open a WebSocket connection to the preview endpoint. To do this, you must specify the [profile](#profiles) to use: + +``` +wss://preview.rt.speechmatics.com/v2/agent/ +``` + +### 2. Authenticate Authenticate every connection using one of the following: @@ -153,113 +160,171 @@ Authenticate every connection using one of the following: See [Authentication](/get-started/authentication) for details including temporary keys. -### Endpoint +### 3. Start the session -The Voice Agent API is available at the preview endpoint. Choose a [profile](#profiles) based on your use case: +Send [`StartRecognition`](#startrecognition) as your first message: +```json +{ + "message": "StartRecognition", + "transcription_config": { + "language": "en" + } +} ``` -wss://preview.rt.speechmatics.com/v2/agent/ -``` +For all configuration options, see [Configuration](#configuration). +The server responds with `RecognitionStarted` when the session is ready. You should wait for this message before sending audio. -For example, to use the `adaptive` profile: -``` -wss://preview.rt.speechmatics.com/v2/agent/adaptive -``` +### 4. Stream audio and handle responses +Send audio as binary WebSocket frames. Turn events will arrive in real time as the API processes speech — see [Session Flow](#session-flow) for the full message sequence. + +--- -### Session flow +## Configuration -1. Open the WebSocket connection. -2. Send `StartRecognition` as the first JSON message. -3. Stream raw PCM audio as binary frames. -4. Send `EndOfStream` when audio is finished. -5. Read server messages until `EndOfTranscript`. -6. Close the connection. +Configuration is passed in [`StartRecognition`](#startrecognition) and is split across two levels of the payload: `audio_format` (top-level) and `transcription_config`. + +**`audio_format`** + +| Field | Notes | +|-------|-------| +| `type` | Must be `raw` | +| `encoding` | Must be `pcm_s16le` (16-bit signed little-endian PCM) | +| `sample_rate` | Must be `8000` or `16000` | + +**`transcription_config`** + +| Field | Default | Notes | +|-------|---------|-------| +| `language` | `en` | All supported languages | +| `output_locale` | — | Output locale (e.g. `en-US`) | +| `additional_vocab` | — | Custom vocabulary entries | +| `punctuation_overrides` | — | Custom punctuation rules | +| `domain` | — | Domain-specific model (e.g. `medical`) | +| `enable_entities` | `false` | Entity detection | +| `enable_partials` | `true` | Emit partial segments during speech | +| `diarization` | `speaker` | Speaker diarization; `none` to disable | +| `volume_threshold` | — | Minimum audio volume to process | + +**`transcription_config.speaker_diarization_config`** + +Note: The following require `diarization: speaker` to be set. +| Field | Default | Notes | +|-------|---------|-------| +| `max_speakers` | — | Maximum number of speakers to track | +| `speaker_sensitivity` | — | Sensitivity of speaker separation | +| `prefer_current_speaker` | — | Bias toward the most recently active speaker | +| `known_speakers` | — | Pre-enrolled speaker identifiers for cross-session recognition (see [Speaker ID](#speaker-id)) | + +**Not supported — will be rejected if present** + +| Field | Notes | +|-------|-------| +| `translation_config` | Not supported on this endpoint | +| `audio_events_config` | Not supported on this endpoint | + +--- + +## API Reference - Client Messages ### StartRecognition -Send this as the first message after connecting: +The first message you send after connecting. Starts the recognition session and passes configuration. The server responds with `RecognitionStarted`. ```json { "message": "StartRecognition", + "audio_format": { + "type": "raw", + "encoding": "pcm_s16le", + "sample_rate": 16000 + }, "transcription_config": { "language": "en" } } ``` -### Configuration reference +For all configuration options, see [Configuration](#configuration). -**Configurable settings (`transcription_config`)** +### EndOfStream -| Setting | Default | Notes | -|---------|---------|-------| -| `language` | `en` | All supported languages | -| `output_locale` | - | Client can specify an output locale (e.g. `en-US`) | -| `additional_vocab` | - | Custom vocabulary entries | -| `punctuation_overrides` | - | Punctuation overrides | -| `domain` | - | Client can specify a domain (e.g. `medical`) | -| `enable_entities` | `false` | Enable entity detection | -| `enable_partials` | `true` | Enable partials in output | -| `diarization` | `speaker` | Supports `none` or `speaker` only | -| `speaker_diarization_config.max_speakers` | - | Limit speaker count | -| `speaker_diarization_config.speaker_sensitivity` | - | Diarization sensitivity | -| `speaker_diarization_config.prefer_current_speaker` | - | Hold on to current speaker | -| `speaker_diarization_config.speakers` | - | Known speakers | -| `volume_threshold` | - | Audio filtering | - -**Not configurable (`transcription_config`)** - -| Setting | Notes | -|---------|-------| -| `operating_point` | Managed per profile | -| `max_delay` | Managed per profile | -| `max_delay_mode` | Managed per profile | -| `streaming_mode` | Always enabled | -| `conversation_config` | Managed by profile / Voice SDK | -| `audio_filtering_config` | Managed by profile | -| `transcript_filtering_config` | Managed by profile | -| `channel_diarization_labels` | Not available | - -**Payload-level settings** - -| Setting | Configurable? | Notes | -|---------|--------------|-------| -| `audio_format` | Yes | Client declares encoding and sample rate | -| `translation_config` | No* | Not supported — rejected if present in the payload | -| `audio_events_config` | No* | Not supported — rejected if present in the payload | -| `message_control` | No | Adjust which messages are forwarded (hidden) | - -### Code examples - -Full working examples in Python and JavaScript are available in the [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer). +Send when you have finished streaming audio. The server finalises any remaining transcript and then emits `EndOfTranscript`. + +`last_seq_no` is the sequence number of the last audio frame you sent. +```json +{ + "message": "EndOfStream", + "last_seq_no": 1234 +} +``` + +### ForceEndOfUtterance + +Only applies to the `external` profile. Immediately ends the current turn — the server finalises all audio received so far and emits a single `AddSegment` containing the complete transcript for that turn, followed by `EndOfTurn`. + +Use this wherever your application decides a turn is complete: on button release (push-to-talk), on VAD silence, or on an LLM signal. + +```json +{ + "message": "ForceEndOfUtterance" +} +``` + +### UpdateSpeakerFocus + +Updates which speakers are in focus, mid-session. Takes effect immediately. See [Speaker Focus](#speaker-focus) for full details. + +```json +{ + "message": "UpdateSpeakerFocus", + "speaker_focus": { + "focus_speakers": ["S1"], + "ignore_speakers": [], + "focus_mode": "retain" + } +} +``` + +### GetSpeakers + +Requests voice identifiers for all speakers diarized so far in the session. The server responds with a `SpeakersResult` message. See [Speaker ID](#speaker-id) for full details. + +```json +{ + "message": "GetSpeakers" +} +``` --- -## Server messages +## API Reference - Server Messages ### Standard messages -Standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full details. +The following standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full payload details. -- `RecognitionStarted` -- `AddPartialTranscript` -- `AddTranscript` -- `EndOfUtterance` -- `EndOfTranscript` -- `Info` -- `Warning` -- `Error` +| Message | When it's emitted | +|---------|------------------| +| `AudioAdded` | | +| `RecognitionStarted` | Session is ready; emitted in response to `StartRecognition` | +| `AddPartialTranscript` | Word-level partial transcript update (lower-level than `AddPartialSegment`) | +| `AddTranscript` | Word-level final transcript (lower-level than `AddSegment`) | +| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | +| `EndOfTranscript` | All audio processed; emitted after `EndOfStream` | +| `Info` | Non-critical informational message from the server | +| `Warning` | Non-fatal issue (e.g. unsupported config ignored) | +| `Error` | Fatal error; connection will close | ### Voice Agent API messages -These messages are only emitted when using a voice profile (`/v2/agent/`). +These messages are only emitted when using a voice agent profile (`/v2/agent/`). #### `StartOfTurn` -Emitted when a speaker begins a new turn. +Emitted when a speaker begins a new turn. Use this to signal to your agent that it should stop speaking if it currently is. ```json { @@ -268,9 +333,12 @@ Emitted when a speaker begins a new turn. } ``` +**Fields:** +- `turn_id` — monotonically increasing integer; pairs with the corresponding `EndOfTurn` + #### `EndOfTurn` -Emitted when a turn is complete. +Emitted when turn detection decides the speaker has finished. This is the trigger for your agent to respond. The finalised transcript for the turn is in the preceding `AddSegment`. ```json { @@ -283,9 +351,13 @@ Emitted when a turn is complete. } ``` +**Fields:** +- `turn_id` — matches the `StartOfTurn` for this turn +- `metadata.start_time` / `metadata.end_time` — audio time range for the turn, in seconds from session start + #### `AddPartialSegment` -Interim transcript updates emitted as the speaker talks. Each new partial replaces the previous one. +Interim transcript update emitted continuously while the speaker is talking. Each new `AddPartialSegment` replaces the previous one — do not concatenate them. ```json { @@ -314,7 +386,9 @@ Interim transcript updates emitted as the speaker talks. Each new partial replac #### `AddSegment` -The final, complete transcript for a turn. Emitted at `EndOfTurn`. This is the stable output to send to your LLM. +The final, complete transcript for a turn. Emitted just before `EndOfTurn`. This is the stable output to pass to your LLM — do not use `AddPartialSegment` for this. + +In multi-speaker scenarios, a single `AddSegment` may contain segments from multiple speakers, returned in time order. ```json { @@ -341,22 +415,25 @@ The final, complete transcript for a turn. Emitted at `EndOfTurn`. This is the s } ``` -**Key fields:** -- `speaker_id` — speaker label (e.g. `S1`, `S2`) -- `is_active` — whether this speaker is in your focus list (see [Speaker focus](#speaker-focus)) -- `is_eou` — `true` on final segments -- `start_time` / `end_time` — time in seconds relative to session start -- `processing_time` (message-level `metadata`) — transcription latency in seconds +**Segment fields:** +- `speaker_id` — speaker label (e.g. `S1`, `S2`, or a custom label if using [Speaker ID](#speaker-id)) +- `is_active` — `true` if this speaker is in your current focus list; `false` if they are a background speaker (see [Speaker Focus](#speaker-focus)) +- `is_eou` — `true` on final segments, `false` on partials +- `text` — clean, punctuated transcript text +- `metadata.start_time` / `metadata.end_time` — time range of this segment in seconds from session start + +**Message-level fields:** +- `metadata.processing_time` — transcription latency in seconds for this message #### `SpeakerStarted` / `SpeakerEnded` -Emitted when a specific speaker starts or stops speaking. Useful for multi-party conversations. +Emitted when a specific speaker starts or stops being heard. These are voice activity events — they fire based on detected speech, independently of turn boundaries. ```json { "message": "SpeakerStarted", - "is_active": true, "speaker_id": "S1", + "is_active": true, "time": 0.84, "metadata": { "start_time": 0.84, "end_time": 0.84 } } @@ -365,21 +442,23 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party ```json { "message": "SpeakerEnded", - "is_active": false, "speaker_id": "S1", + "is_active": true, "time": 3.24, "metadata": { "start_time": 0.84, "end_time": 3.24 } } ``` -**Key fields:** -- `time` — seconds of audio from session start when the speaker activity occurred -- `metadata.start_time` — when that speaker started their current speaking interval -- `metadata.end_time` (`SpeakerEnded` only) — when that speaker stopped speaking +**Fields:** +- `speaker_id` — the speaker whose activity changed +- `is_active` — whether this speaker is in your current focus list +- `time` — seconds from session start when the activity was detected +- `metadata.start_time` — when this speaker started their current speaking interval +- `metadata.end_time` — when this speaker stopped speaking (`SpeakerEnded` only) -#### `SessionMetrics` / `SpeakerMetrics` +#### `SessionMetrics` -`SessionMetrics` is emitted every 5 seconds and at the end of the session. `SpeakerMetrics` is emitted each time a speaker speaks a word. +Emitted every 5 seconds and once at the end of the session. ```json { @@ -391,6 +470,10 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party } ``` +#### `SpeakerMetrics` + +Emitted each time a speaker produces a recognised word. + ```json { "message": "SpeakerMetrics", @@ -405,49 +488,67 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party } ``` +#### SpeakersResult + +Emitted as a response to a `GetSpeakers` message. + +```json +{ + "message": "SpeakersResult", + "speakers": [ + { "label": "S1", "speaker_identifiers": [""] }, + { "label": "S2", "speaker_identifiers": [""] } + ] +} +``` + + --- -## Speaker focus +## Features + +The Voice Agent API introduces key features built with voice agents in mind. These include: +### **Speaker Focus** +This lets you control which speakers' output your agent acts on. By default, all detected speakers are active and and their transcripts are included in `AddSegment` output. + + Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled and persist for the lifetime of the session. + Send `UpdateSpeakerFocus` at any point during the session to change who is in focus - the new config takes place immediately and replaces the previous one. + + +### Speaker Focus -You can update speaker focus mid-session using `UpdateSpeakerFocus`. This is a Voice Agent API feature — sending it in standard RT mode has no effect. +Speaker focus lets you control which speakers' output your agent acts on. By default, all detected speakers are active and their transcripts are included in `AddSegment` output. -Diarization is enabled by default when using the Voice Agent API. Speaker IDs (`S1`, `S2`, etc.) are assigned automatically and persist across the session. +Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled, and persist for the lifetime of the session. Send `UpdateSpeakerFocus` at any point during the session to change who is in focus — the new config takes effect immediately and replaces the previous one. ```json { "message": "UpdateSpeakerFocus", "speaker_focus": { "focus_speakers": ["S1"], - "ignore_speakers": [], + "ignore_speakers": ["S3"], "focus_mode": "retain" } } ``` -**`focus_mode` options:** - -- `retain` — non-focused speakers remain in output as passive speakers (`is_active: false`) -- `ignore` — non-focused speakers are excluded from output entirely +**Fields:** -The new config replaces the existing config immediately. +- `focus_speakers` — speaker IDs to treat as active. Their segments appear with `is_active: true`. +- `ignore_speakers` — speaker IDs to exclude entirely. Their speech is dropped and does not affect turn detection. +- `focus_mode` — what happens to speakers who are neither in `focus_speakers` nor `ignore_speakers`: + - `retain` — they remain in the output as passive speakers (`is_active: false`) + - `ignore` — they are excluded from the output entirely ---- +### Speaker ID -## Speaker ID +Speaker ID lets you recognise the same person across separate sessions. At the end of a session, you can retrieve voice identifiers for each speaker and store them. In future sessions, pass those identifiers into `StartRecognition` and the system will tag matching speakers with a consistent label rather than a generic `S1`, `S2`. -Speaker identifiers let you recognise known speakers across sessions. Once you have identifiers for a speaker, you can pass them into future sessions so the system tags them with a consistent label rather than a generic `S1`, `S2`. +#### Getting identifiers -### Getting identifiers — `GetSpeakers` +Send `GetSpeakers` at any point during a session to retrieve identifiers for all diarized speakers so far. The server responds with a `SpeakersResult` message. -Send `GetSpeakers` during a session to request identifiers for all diarized speakers so far: - -```json -{ - "message": "GetSpeakers" -} -``` - -The server responds with a `SpeakersResult` message: +`SpeakersResult` response: ```json { @@ -459,11 +560,11 @@ The server responds with a `SpeakersResult` message: } ``` -Store the `speaker_identifiers` values — these are opaque tokens that represent the speaker's voice profile. +Store the `speaker_identifiers` values. These are opaque tokens tied to a speaker's voice profile — treat them as credentials and store them securely. -### Using identifiers in future sessions +#### Using identifiers in future sessions -Pass stored identifiers into `StartRecognition` via `known_speakers`. You can assign any label you like: +Pass stored identifiers into `StartRecognition` via `transcription_config.known_speakers`. You can assign any label: ```json { @@ -478,18 +579,25 @@ Pass stored identifiers into `StartRecognition` via `known_speakers`. You can as } ``` -When those speakers are detected, segments will be tagged with `"Alice"` or `"Bob"` instead of generic labels. Any unrecognised speakers are still assigned generic labels (`S1`, `S2`, etc.). +When those speakers are detected, their segments will carry `"Alice"` or `"Bob"` as the `speaker_id` instead of generic labels. Any unrecognised speakers are still assigned generic labels (`S1`, `S2`, etc.). + +--- + +## Code Examples + +For working code examples in Python and JavaScript, see the [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer). --- ## Feedback -This is a preview and your feedback shapes what goes to GA. We'd love to hear from you — whether that's something that didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly. +This is a preview and your feedback shapes what goes to GA (General Availability). +We'd love to hear from you — Tell us what works well, which features you use, whether something didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly. Specific areas of interest: -- integration experience (documentation, SDKs, API messages/metadata) -- Accuracy/Latency (including data capture if it's relevant (e.g. phone numbers, spell outs of names/account numbers) +- Integration experience (documentation, SDKs, API messages/metadata) +- Accuracy and latency (including data capture if it's relevant. E.g. phone numbers, spell outs of names/account numbers) - Turn detection and experience with different profiles - Any missing capabilities which would make your product better - What would stop you using this in production From 1dd76a54da7d456f85ee6ee60beff8dbbaa2ee49 Mon Sep 17 00:00:00 2001 From: Lorna Armstrong Date: Thu, 19 Mar 2026 08:03:25 +0000 Subject: [PATCH 2/2] Restructure and Expand Message Coverage --- docs/private/voice-agent-api.mdx | 211 ++++++++++++++++++++----------- 1 file changed, 134 insertions(+), 77 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 08a5e3c..04bc085 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -18,31 +18,21 @@ The Voice Agent API is a WebSocket API for building voice agents. Stream audio i Turn detection runs server-side. Choose a [profile](#profiles) based on your use case and the API handles when to finalise each speaker's turn. -To jump straight into code, see working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for both Python and JavaScript. +**Looking for code examples?** See working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for Python and JavaScript. --- ## Profiles -Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, and the server handles the rest. +Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, include it in your endpoint URL, and the server handles the rest. | Profile | Turn detection | Best for | |---------|---------------|----------| -| `agile` | VAD-based silence detection | Speed-first use cases | | `adaptive` | Adapts to speaker pace and hesitation | General conversational agents | +| `agile` | VAD-based silence detection | Speed-first use cases | | `smart` | `adaptive` + ML acoustic turn prediction | High-stakes conversations | | `external` | Manual — you trigger turn end | Push-to-talk, custom VAD, LLM-driven | -### `agile` - -**Endpoint:** `/v2/agent/agile` - -Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. - -**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. - -**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. - ### `adaptive` **Endpoint:** `/v2/agent/adaptive` @@ -53,6 +43,16 @@ Adapts to each speaker's pace over the course of a conversation. It adjusts the **Trade-off:** Latency varies by speaker. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. +### `agile` + +**Endpoint:** `/v2/agent/agile` + +Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. + +**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. + +**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. + ### `smart` **Endpoint:** `/v2/agent/smart` @@ -91,17 +91,26 @@ sequenceDiagram loop Audio Stream C->>S: Audio frames (binary) S-->>C: AudioAdded + + S-->>C: SpeechStarted S-->>C: StartOfTurn S-->>C: AddPartialSegment + + opt Turn prediction (adaptive, smart profiles) + S-->>C: EndOfTurnPrediction + S-->>C: SmartTurnPrediction (smart only) + end + S-->>C: AddSegment S-->>C: EndOfTurn + S-->>C: SpeechEnded - opt Optional — speaker events + opt Speaker activity S-->>C: SpeakerStarted / SpeakerEnded S-->>C: SessionMetrics / SpeakerMetrics end - opt Optional — mid-session controls + opt Mid-session controls C->>S: ForceEndOfUtterance (external profile only) C->>S: UpdateSpeakerFocus C->>S: GetSpeakers @@ -113,27 +122,7 @@ sequenceDiagram S-->>C: EndOfTranscript ``` -**Client → Server** - -| Message | When to send | -|---------|-------------| -| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | -| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously while audio is available. | -| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Signals that the current turn is complete. | -| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | -| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for enrolled speakers. | -| [`EndOfStream`](#endofstream) | When there is no more audio to send. | - -**Server → Client** - -| Message | When it's emitted | -|---------|------------------| -| [`RecognitionStarted`](#standard-messages) | Session is ready. | -| [`StartOfTurn`](#startofturn) | A speaker begins a new turn. | -| [`AddPartialSegment`](#addpartialsegment) | Interim transcript update. Replaces the previous partial. | -| [`AddSegment`](#addsegment) | Final transcript for the turn. Send this to your LLM. | -| [`EndOfTurn`](#endofturn) | Turn is complete. Your agent can now respond. | -| [`EndOfTranscript`](#standard-messages) | All audio processed. Emitted after `EndOfStream`. | +For a full reference of all messages, see [Messages Overview](#messages-overview). --- @@ -173,6 +162,7 @@ Send [`StartRecognition`](#startrecognition) as your first message: } ``` For all configuration options, see [Configuration](#configuration). + The server responds with `RecognitionStarted` when the session is ready. You should wait for this message before sending audio. @@ -227,11 +217,81 @@ Note: The following require `diarization: speaker` to be set. --- +## Messages Overview + +All messages exchanged during a Voice Agent API session. For payload details, see the API Reference sections. + +### Client → Server + +| Message | When to send | +|---------|-------------| +| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | +| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously. | +| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Triggers immediate turn finalisation. | +| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | +| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for diarized speakers. | +| [`EndOfStream`](#endofstream) | When there is no more audio to send. | + +### Server → Client + +**Core turn events** — the messages your agent logic acts on + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`StartOfTurn`](#startofturn) | All | A speaker begins a new turn | +| [`AddPartialSegment`](#addpartialsegment) | All | Interim transcript update; each replaces the previous | +| [`AddSegment`](#addsegment) | All | Final transcript for the turn — pass this to your LLM | +| [`EndOfTurn`](#endofturn) | All | Turn complete; your agent can now respond | + +**Turn prediction** — early signals you can use to prepare a response + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`EndOfTurnPrediction`](#endofturnprediction) | `adaptive`, `smart` | The model predicts the current turn will end soon | +| [`SmartTurnPrediction`](#smartturnprediction) | `smart` only | High-confidence acoustic prediction of turn completion | + +**Speech and speaker activity** + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`SpeechStarted`](#speechstarted--speechended) | All | Voice activity detected in the audio stream | +| [`SpeechEnded`](#speechstarted--speechended) | All | Voice activity stopped | +| [`SpeakerStarted`](#speakerstarted--speakerended) | All | A specific diarized speaker began talking | +| [`SpeakerEnded`](#speakerstarted--speakerended) | All | A specific diarized speaker stopped talking | +| [`SpeakersResult`](#speakersresult) | All | Response to `GetSpeakers` | + +**Session lifecycle** + +| Message | When it's emitted | +|---------|------------------| +| `RecognitionStarted` | Session ready; emitted in response to `StartRecognition` | +| `AudioAdded` | Audio frame acknowledged | +| `EndOfTranscript` | Session closing; emitted by the proxy after `EndOfStream` | + +**Metrics and diagnostics** + +| Message | When it's emitted | +|---------|------------------| +| [`SessionMetrics`](#sessionmetrics) | Session stats; emitted every 5 seconds and at session end | +| [`SpeakerMetrics`](#speakermetrics) | Per-speaker word count and volume; emitted on each recognised word | + +**Shared messages with the RT API** - messages shared with the RT API. See the [RT API Reference](/api-ref) for full payload details. + +| Message | When it's emitted | +|---------|------------------| +| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | +| `Info` | Non-critical informational message | +| `Warning` | Non-fatal issue (e.g. unsupported config field ignored) | +| `Error` | Fatal error; connection will close | + +--- + ## API Reference - Client Messages -### StartRecognition +#### StartRecognition -The first message you send after connecting. Starts the recognition session and passes configuration. The server responds with `RecognitionStarted`. +The first message you send after connecting. Starts the recognition session and passes configuration. +The server responds with `RecognitionStarted`. ```json { @@ -249,7 +309,7 @@ The first message you send after connecting. Starts the recognition session and For all configuration options, see [Configuration](#configuration). -### EndOfStream +#### EndOfStream Send when you have finished streaming audio. The server finalises any remaining transcript and then emits `EndOfTranscript`. @@ -261,7 +321,7 @@ Send when you have finished streaming audio. The server finalises any remaining } ``` -### ForceEndOfUtterance +#### ForceEndOfUtterance Only applies to the `external` profile. Immediately ends the current turn — the server finalises all audio received so far and emits a single `AddSegment` containing the complete transcript for that turn, followed by `EndOfTurn`. @@ -273,7 +333,7 @@ Use this wherever your application decides a turn is complete: on button release } ``` -### UpdateSpeakerFocus +#### UpdateSpeakerFocus Updates which speakers are in focus, mid-session. Takes effect immediately. See [Speaker Focus](#speaker-focus) for full details. @@ -288,7 +348,7 @@ Updates which speakers are in focus, mid-session. Takes effect immediately. See } ``` -### GetSpeakers +#### GetSpeakers Requests voice identifiers for all speakers diarized so far in the session. The server responds with a `SpeakersResult` message. See [Speaker ID](#speaker-id) for full details. @@ -302,27 +362,9 @@ Requests voice identifiers for all speakers diarized so far in the session. The ## API Reference - Server Messages -### Standard messages - -The following standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full payload details. +This section covers Voice Agent API-specific messages only. For shared messages (`RecognitionStarted`, `AudioAdded`, `AddPartialTranscript`, `AddTranscript`, `EndOfUtterance`, `EndOfTranscript`, `Info`, `Warning`, `Error`), see the [RT API reference](/api-ref). -| Message | When it's emitted | -|---------|------------------| -| `AudioAdded` | | -| `RecognitionStarted` | Session is ready; emitted in response to `StartRecognition` | -| `AddPartialTranscript` | Word-level partial transcript update (lower-level than `AddPartialSegment`) | -| `AddTranscript` | Word-level final transcript (lower-level than `AddSegment`) | -| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | -| `EndOfTranscript` | All audio processed; emitted after `EndOfStream` | -| `Info` | Non-critical informational message from the server | -| `Warning` | Non-fatal issue (e.g. unsupported config ignored) | -| `Error` | Fatal error; connection will close | - -### Voice Agent API messages - -These messages are only emitted when using a voice agent profile (`/v2/agent/`). - -#### `StartOfTurn` +#### StartOfTurn Emitted when a speaker begins a new turn. Use this to signal to your agent that it should stop speaking if it currently is. @@ -336,7 +378,7 @@ Emitted when a speaker begins a new turn. Use this to signal to your agent that **Fields:** - `turn_id` — monotonically increasing integer; pairs with the corresponding `EndOfTurn` -#### `EndOfTurn` +#### EndOfTurn Emitted when turn detection decides the speaker has finished. This is the trigger for your agent to respond. The finalised transcript for the turn is in the preceding `AddSegment`. @@ -355,7 +397,7 @@ Emitted when turn detection decides the speaker has finished. This is the trigge - `turn_id` — matches the `StartOfTurn` for this turn - `metadata.start_time` / `metadata.end_time` — audio time range for the turn, in seconds from session start -#### `AddPartialSegment` +#### AddPartialSegment Interim transcript update emitted continuously while the speaker is talking. Each new `AddPartialSegment` replaces the previous one — do not concatenate them. @@ -384,7 +426,7 @@ Interim transcript update emitted continuously while the speaker is talking. Eac } ``` -#### `AddSegment` +#### AddSegment The final, complete transcript for a turn. Emitted just before `EndOfTurn`. This is the stable output to pass to your LLM — do not use `AddPartialSegment` for this. @@ -425,7 +467,7 @@ In multi-speaker scenarios, a single `AddSegment` may contain segments from mult **Message-level fields:** - `metadata.processing_time` — transcription latency in seconds for this message -#### `SpeakerStarted` / `SpeakerEnded` +#### SpeakerStarted / SpeakerEnded Emitted when a specific speaker starts or stops being heard. These are voice activity events — they fire based on detected speech, independently of turn boundaries. @@ -456,7 +498,7 @@ Emitted when a specific speaker starts or stops being heard. These are voice act - `metadata.start_time` — when this speaker started their current speaking interval - `metadata.end_time` — when this speaker stopped speaking (`SpeakerEnded` only) -#### `SessionMetrics` +#### SessionMetrics Emitted every 5 seconds and once at the end of the session. @@ -470,7 +512,7 @@ Emitted every 5 seconds and once at the end of the session. } ``` -#### `SpeakerMetrics` +#### SpeakerMetrics Emitted each time a speaker produces a recognised word. @@ -490,7 +532,7 @@ Emitted each time a speaker produces a recognised word. #### SpeakersResult -Emitted as a response to a `GetSpeakers` message. +Emitted in response to `GetSpeakers`. Contains voice identifiers for all diarized speakers so far. See [Speaker ID](#speaker-id) for how to store and use these. ```json { @@ -502,18 +544,33 @@ Emitted as a response to a `GetSpeakers` message. } ``` +#### EndOfTurnPrediction ---- +Emitted by `adaptive` and `smart` profiles when the model predicts the current turn is about to end. Can be used to begin preparing a response before `EndOfTurn` arrives, reducing perceived latency. -## Features +:::note +todo - payload details. +::: + +#### SmartTurnPrediction + +Emitted by the `smart` profile only. A higher-confidence acoustic prediction of turn completion, based on the ML model that analyses vocal cues. + +:::note +todo - payload details. +::: + +#### SpeechStarted / SpeechEnded -The Voice Agent API introduces key features built with voice agents in mind. These include: -### **Speaker Focus** -This lets you control which speakers' output your agent acts on. By default, all detected speakers are active and and their transcripts are included in `AddSegment` output. - - Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled and persist for the lifetime of the session. - Send `UpdateSpeakerFocus` at any point during the session to change who is in focus - the new config takes place immediately and replaces the previous one. +Voice activity detection events. Emitted when speech is first detected in the audio stream (`SpeechStarted`) or stops (`SpeechEnded`). These fire independently of speaker identity and turn boundaries. +:::note +todo - payload details. +::: + +--- + +## Features ### Speaker Focus