diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 869444a..04bc085 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -14,133 +14,129 @@ description: Early access to the Voice Agent API — a turn-based API built for ## Introduction -The Voice Agent API is a turn-based API built for voice agents. It is designed for developers building low-latency integrations between speech and LLMs — with turn detection, speaker awareness, and segment-based output built in so you can focus on your agent logic. +The Voice Agent API is a WebSocket API for building voice agents. Stream audio in and receive speaker-labelled, turn-based transcription back — clean, punctuated, and ready to pass directly to an LLM. ---- - -## What it does +Turn detection runs server-side. Choose a [profile](#profiles) based on your use case and the API handles when to finalise each speaker's turn. -The Voice Agent API is a turn-based API. Rather than a stream of word-level events, speech is grouped into segments — and turn detection determines when a speaker has finished, triggering fast finalisation of those segments. +**Looking for code examples?** See working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for Python and JavaScript. -You receive: +--- -- `StartOfTurn` — when a speaker begins talking -- `AddPartialSegment` — interim transcript updates as they speak -- `AddSegment` — the final, complete transcript for that turn -- `EndOfTurn` — when the turn is complete +## Profiles -When a turn ends, you receive an `AddSegment` containing the finalised utterance. In multi-speaker scenarios, a single message may contain segments from multiple speakers, returned in time order: +Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, include it in your endpoint URL, and the server handles the rest. -```json -{ - "message": "AddSegment", - "segments": [ - { - "speaker_id": "S1", - "is_active": true, - "timestamp": "2025-01-01T12:00:00.000+00:00", - "language": "en", - "text": "Welcome to Speechmatics.", - "is_eou": true, - "metadata": { - "start_time": 0.84, - "end_time": 1.56 - } - }, - { - "speaker_id": "S2", - "is_active": true, - "timestamp": "2025-01-01T12:00:02.000+00:00", - "language": "en", - "text": "Thank you for testing the Voice Agent API.", - "is_eou": true, - "metadata": { - "start_time": 2.10, - "end_time": 3.80 - } - } - ], - "metadata": { - "start_time": 0.84, - "end_time": 3.80, - "processing_time": 0.25 - } -} -``` +| Profile | Turn detection | Best for | +|---------|---------------|----------| +| `adaptive` | Adapts to speaker pace and hesitation | General conversational agents | +| `agile` | VAD-based silence detection | Speed-first use cases | +| `smart` | `adaptive` + ML acoustic turn prediction | High-stakes conversations | +| `external` | Manual — you trigger turn end | Push-to-talk, custom VAD, LLM-driven | -Each segment's `text` field is clean, punctuated, and ready to use. When a message contains multiple segments, you'll need to concatenate them. The SDK reconstructs the exchange using `speaker_id` and `is_active` — non-active speakers (outside your focus list) are marked as `[background]`: +### `adaptive` -```python -' '.join([f"@{s.speaker_id}{'' if s.is_active else ' [background]'}: {s.text}" for s in segments]) -``` +**Endpoint:** `/v2/agent/adaptive` -Which produces: +Adapts to each speaker's pace over the course of a conversation. It adjusts the turn-end threshold based on speech rate and disfluencies (e.g. hesitations, filler words), waiting longer for speakers who tend to pause mid-thought. -``` -@S1: Hello there. @S2 [background]: It was yesterday. @S1: How are you getting on? -``` +**Best for:** General conversational voice agents. -No accumulating partials, no stitching words together, no guessing when the speaker has finished. The turn detection handles all of that, so your agent can respond as fast as possible. +**Trade-off:** Latency varies by speaker. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. ---- +### `agile` -## Profiles +**Endpoint:** `/v2/agent/agile` -Profiles are pre-tuned configurations for voice agents. Each profile sets the right defaults for turn detection, latency, and endpointing — no need to configure the API settings yourself. +Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. -Choose the profile that best fits your use case: +**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. -### `agile` +**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. -**Endpoint:** `/v2/agent/agile` +### `smart` -Lowest end-of-speech to final latency. Uses voice activity detection to finalise turns as quickly as possible. +**Endpoint:** `/v2/agent/smart` -**Best for:** Use cases where response speed is the top priority. +Builds on `adaptive` with an additional ML model that analyses acoustic cues to predict whether a speaker has genuinely finished their turn. The most conservative profile — least likely to interrupt. -**Trade-off:** May produce more finalised segments mid-speaker, which can result in additional downstream LLM calls. +**Best for:** High-stakes conversations where cutting off the user is costly — finance, healthcare, legal. ---- +**Trade-off:** Higher latency than `adaptive`. Supported languages: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese. -### `adaptive` +### `external` -**Endpoint:** `/v2/agent/adaptive` +**Endpoint:** `/v2/agent/external` -Adapts to each speaker over the course of a conversation. Waits longer for slow speakers or those who hesitate frequently. Works with all languages. +Turn detection is fully manual. The server accumulates audio and transcript until you send a `ForceEndOfUtterance` message, at which point it finalises everything spoken up to that point and emits an `AddSegment`. -**Best for:** General conversational voice agents. +**Best for:** Push-to-talk interfaces, custom VAD pipelines, or setups where an LLM decides when to respond. -**Trade-off:** Latency is not consistently the fastest. Disfluency/hesitation detection is English-only — other languages use speech-rate adaptation only. +**Trade-off:** You are responsible for all turn detection logic. --- -### `smart` +## Session Flow -**Endpoint:** `/v2/agent/smart` +Every session follows the same structure: connect, start recognition, stream audio, receive turn events, close. -Builds on `adaptive` and additionally analyses vocal tone to improve turn completion. The most conservative profile. +```mermaid +sequenceDiagram + participant C as Client + participant S as Server -**Best for:** High-stakes conversations where interrupting the user is costly (finance, healthcare, legal). + C->>S: Connect to endpoint with profile via WebSocket + C->>S: StartRecognition + S-->>C: RecognitionStarted -**Trade-off:** Higher latency than `adaptive`. Supported languages: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese. + loop Audio Stream + C->>S: Audio frames (binary) + S-->>C: AudioAdded ---- + S-->>C: SpeechStarted + S-->>C: StartOfTurn + S-->>C: AddPartialSegment -### `external` + opt Turn prediction (adaptive, smart profiles) + S-->>C: EndOfTurnPrediction + S-->>C: SmartTurnPrediction (smart only) + end -**Endpoint:** `/v2/agent/external` + S-->>C: AddSegment + S-->>C: EndOfTurn + S-->>C: SpeechEnded + + opt Speaker activity + S-->>C: SpeakerStarted / SpeakerEnded + S-->>C: SessionMetrics / SpeakerMetrics + end -You control when a turn ends. Send a `ForceEndOfUtterance` message to trigger finalisation — the server will return a combined segment of everything spoken up to that point. + opt Mid-session controls + C->>S: ForceEndOfUtterance (external profile only) + C->>S: UpdateSpeakerFocus + C->>S: GetSpeakers + S-->>C: SpeakersResult + end + end -**Best for:** Push-to-talk, custom VAD, or LLM-driven turn detection. + C->>S: EndOfStream + S-->>C: EndOfTranscript +``` -**Trade-off:** Most complex to implement — you are responsible for turn detection logic. +For a full reference of all messages, see [Messages Overview](#messages-overview). --- -## Getting started +## Getting Started + +### 1. Connect + +Open a WebSocket connection to the preview endpoint. To do this, you must specify the [profile](#profiles) to use: + +``` +wss://preview.rt.speechmatics.com/v2/agent/ +``` -### Authentication +### 2. Authenticate Authenticate every connection using one of the following: @@ -153,113 +149,224 @@ Authenticate every connection using one of the following: See [Authentication](/get-started/authentication) for details including temporary keys. -### Endpoint +### 3. Start the session -The Voice Agent API is available at the preview endpoint. Choose a [profile](#profiles) based on your use case: +Send [`StartRecognition`](#startrecognition) as your first message: +```json +{ + "message": "StartRecognition", + "transcription_config": { + "language": "en" + } +} ``` -wss://preview.rt.speechmatics.com/v2/agent/ -``` +For all configuration options, see [Configuration](#configuration). -For example, to use the `adaptive` profile: +The server responds with `RecognitionStarted` when the session is ready. You should wait for this message before sending audio. -``` -wss://preview.rt.speechmatics.com/v2/agent/adaptive -``` +### 4. Stream audio and handle responses + +Send audio as binary WebSocket frames. Turn events will arrive in real time as the API processes speech — see [Session Flow](#session-flow) for the full message sequence. + +--- + +## Configuration + +Configuration is passed in [`StartRecognition`](#startrecognition) and is split across two levels of the payload: `audio_format` (top-level) and `transcription_config`. + +**`audio_format`** + +| Field | Notes | +|-------|-------| +| `type` | Must be `raw` | +| `encoding` | Must be `pcm_s16le` (16-bit signed little-endian PCM) | +| `sample_rate` | Must be `8000` or `16000` | + +**`transcription_config`** + +| Field | Default | Notes | +|-------|---------|-------| +| `language` | `en` | All supported languages | +| `output_locale` | — | Output locale (e.g. `en-US`) | +| `additional_vocab` | — | Custom vocabulary entries | +| `punctuation_overrides` | — | Custom punctuation rules | +| `domain` | — | Domain-specific model (e.g. `medical`) | +| `enable_entities` | `false` | Entity detection | +| `enable_partials` | `true` | Emit partial segments during speech | +| `diarization` | `speaker` | Speaker diarization; `none` to disable | +| `volume_threshold` | — | Minimum audio volume to process | + +**`transcription_config.speaker_diarization_config`** + +Note: The following require `diarization: speaker` to be set. +| Field | Default | Notes | +|-------|---------|-------| +| `max_speakers` | — | Maximum number of speakers to track | +| `speaker_sensitivity` | — | Sensitivity of speaker separation | +| `prefer_current_speaker` | — | Bias toward the most recently active speaker | +| `known_speakers` | — | Pre-enrolled speaker identifiers for cross-session recognition (see [Speaker ID](#speaker-id)) | + +**Not supported — will be rejected if present** + +| Field | Notes | +|-------|-------| +| `translation_config` | Not supported on this endpoint | +| `audio_events_config` | Not supported on this endpoint | + +--- + +## Messages Overview + +All messages exchanged during a Voice Agent API session. For payload details, see the API Reference sections. + +### Client → Server + +| Message | When to send | +|---------|-------------| +| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | +| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously. | +| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Triggers immediate turn finalisation. | +| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | +| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for diarized speakers. | +| [`EndOfStream`](#endofstream) | When there is no more audio to send. | + +### Server → Client -### Session flow +**Core turn events** — the messages your agent logic acts on -1. Open the WebSocket connection. -2. Send `StartRecognition` as the first JSON message. -3. Stream raw PCM audio as binary frames. -4. Send `EndOfStream` when audio is finished. -5. Read server messages until `EndOfTranscript`. -6. Close the connection. +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`StartOfTurn`](#startofturn) | All | A speaker begins a new turn | +| [`AddPartialSegment`](#addpartialsegment) | All | Interim transcript update; each replaces the previous | +| [`AddSegment`](#addsegment) | All | Final transcript for the turn — pass this to your LLM | +| [`EndOfTurn`](#endofturn) | All | Turn complete; your agent can now respond | -### StartRecognition +**Turn prediction** — early signals you can use to prepare a response -Send this as the first message after connecting: +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`EndOfTurnPrediction`](#endofturnprediction) | `adaptive`, `smart` | The model predicts the current turn will end soon | +| [`SmartTurnPrediction`](#smartturnprediction) | `smart` only | High-confidence acoustic prediction of turn completion | + +**Speech and speaker activity** + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`SpeechStarted`](#speechstarted--speechended) | All | Voice activity detected in the audio stream | +| [`SpeechEnded`](#speechstarted--speechended) | All | Voice activity stopped | +| [`SpeakerStarted`](#speakerstarted--speakerended) | All | A specific diarized speaker began talking | +| [`SpeakerEnded`](#speakerstarted--speakerended) | All | A specific diarized speaker stopped talking | +| [`SpeakersResult`](#speakersresult) | All | Response to `GetSpeakers` | + +**Session lifecycle** + +| Message | When it's emitted | +|---------|------------------| +| `RecognitionStarted` | Session ready; emitted in response to `StartRecognition` | +| `AudioAdded` | Audio frame acknowledged | +| `EndOfTranscript` | Session closing; emitted by the proxy after `EndOfStream` | + +**Metrics and diagnostics** + +| Message | When it's emitted | +|---------|------------------| +| [`SessionMetrics`](#sessionmetrics) | Session stats; emitted every 5 seconds and at session end | +| [`SpeakerMetrics`](#speakermetrics) | Per-speaker word count and volume; emitted on each recognised word | + +**Shared messages with the RT API** - messages shared with the RT API. See the [RT API Reference](/api-ref) for full payload details. + +| Message | When it's emitted | +|---------|------------------| +| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | +| `Info` | Non-critical informational message | +| `Warning` | Non-fatal issue (e.g. unsupported config field ignored) | +| `Error` | Fatal error; connection will close | + +--- + +## API Reference - Client Messages + +#### StartRecognition + +The first message you send after connecting. Starts the recognition session and passes configuration. +The server responds with `RecognitionStarted`. ```json { "message": "StartRecognition", + "audio_format": { + "type": "raw", + "encoding": "pcm_s16le", + "sample_rate": 16000 + }, "transcription_config": { "language": "en" } } ``` -### Configuration reference +For all configuration options, see [Configuration](#configuration). -**Configurable settings (`transcription_config`)** +#### EndOfStream -| Setting | Default | Notes | -|---------|---------|-------| -| `language` | `en` | All supported languages | -| `output_locale` | - | Client can specify an output locale (e.g. `en-US`) | -| `additional_vocab` | - | Custom vocabulary entries | -| `punctuation_overrides` | - | Punctuation overrides | -| `domain` | - | Client can specify a domain (e.g. `medical`) | -| `enable_entities` | `false` | Enable entity detection | -| `enable_partials` | `true` | Enable partials in output | -| `diarization` | `speaker` | Supports `none` or `speaker` only | -| `speaker_diarization_config.max_speakers` | - | Limit speaker count | -| `speaker_diarization_config.speaker_sensitivity` | - | Diarization sensitivity | -| `speaker_diarization_config.prefer_current_speaker` | - | Hold on to current speaker | -| `speaker_diarization_config.speakers` | - | Known speakers | -| `volume_threshold` | - | Audio filtering | - -**Not configurable (`transcription_config`)** - -| Setting | Notes | -|---------|-------| -| `operating_point` | Managed per profile | -| `max_delay` | Managed per profile | -| `max_delay_mode` | Managed per profile | -| `streaming_mode` | Always enabled | -| `conversation_config` | Managed by profile / Voice SDK | -| `audio_filtering_config` | Managed by profile | -| `transcript_filtering_config` | Managed by profile | -| `channel_diarization_labels` | Not available | - -**Payload-level settings** - -| Setting | Configurable? | Notes | -|---------|--------------|-------| -| `audio_format` | Yes | Client declares encoding and sample rate | -| `translation_config` | No* | Not supported — rejected if present in the payload | -| `audio_events_config` | No* | Not supported — rejected if present in the payload | -| `message_control` | No | Adjust which messages are forwarded (hidden) | - -### Code examples - -Full working examples in Python and JavaScript are available in the [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer). +Send when you have finished streaming audio. The server finalises any remaining transcript and then emits `EndOfTranscript`. ---- +`last_seq_no` is the sequence number of the last audio frame you sent. +```json +{ + "message": "EndOfStream", + "last_seq_no": 1234 +} +``` -## Server messages +#### ForceEndOfUtterance -### Standard messages +Only applies to the `external` profile. Immediately ends the current turn — the server finalises all audio received so far and emits a single `AddSegment` containing the complete transcript for that turn, followed by `EndOfTurn`. -Standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full details. +Use this wherever your application decides a turn is complete: on button release (push-to-talk), on VAD silence, or on an LLM signal. -- `RecognitionStarted` -- `AddPartialTranscript` -- `AddTranscript` -- `EndOfUtterance` -- `EndOfTranscript` -- `Info` -- `Warning` -- `Error` +```json +{ + "message": "ForceEndOfUtterance" +} +``` -### Voice Agent API messages +#### UpdateSpeakerFocus -These messages are only emitted when using a voice profile (`/v2/agent/`). +Updates which speakers are in focus, mid-session. Takes effect immediately. See [Speaker Focus](#speaker-focus) for full details. -#### `StartOfTurn` +```json +{ + "message": "UpdateSpeakerFocus", + "speaker_focus": { + "focus_speakers": ["S1"], + "ignore_speakers": [], + "focus_mode": "retain" + } +} +``` -Emitted when a speaker begins a new turn. +#### GetSpeakers + +Requests voice identifiers for all speakers diarized so far in the session. The server responds with a `SpeakersResult` message. See [Speaker ID](#speaker-id) for full details. + +```json +{ + "message": "GetSpeakers" +} +``` + +--- + +## API Reference - Server Messages + +This section covers Voice Agent API-specific messages only. For shared messages (`RecognitionStarted`, `AudioAdded`, `AddPartialTranscript`, `AddTranscript`, `EndOfUtterance`, `EndOfTranscript`, `Info`, `Warning`, `Error`), see the [RT API reference](/api-ref). + +#### StartOfTurn + +Emitted when a speaker begins a new turn. Use this to signal to your agent that it should stop speaking if it currently is. ```json { @@ -268,9 +375,12 @@ Emitted when a speaker begins a new turn. } ``` -#### `EndOfTurn` +**Fields:** +- `turn_id` — monotonically increasing integer; pairs with the corresponding `EndOfTurn` + +#### EndOfTurn -Emitted when a turn is complete. +Emitted when turn detection decides the speaker has finished. This is the trigger for your agent to respond. The finalised transcript for the turn is in the preceding `AddSegment`. ```json { @@ -283,9 +393,13 @@ Emitted when a turn is complete. } ``` -#### `AddPartialSegment` +**Fields:** +- `turn_id` — matches the `StartOfTurn` for this turn +- `metadata.start_time` / `metadata.end_time` — audio time range for the turn, in seconds from session start -Interim transcript updates emitted as the speaker talks. Each new partial replaces the previous one. +#### AddPartialSegment + +Interim transcript update emitted continuously while the speaker is talking. Each new `AddPartialSegment` replaces the previous one — do not concatenate them. ```json { @@ -312,9 +426,11 @@ Interim transcript updates emitted as the speaker talks. Each new partial replac } ``` -#### `AddSegment` +#### AddSegment + +The final, complete transcript for a turn. Emitted just before `EndOfTurn`. This is the stable output to pass to your LLM — do not use `AddPartialSegment` for this. -The final, complete transcript for a turn. Emitted at `EndOfTurn`. This is the stable output to send to your LLM. +In multi-speaker scenarios, a single `AddSegment` may contain segments from multiple speakers, returned in time order. ```json { @@ -341,22 +457,25 @@ The final, complete transcript for a turn. Emitted at `EndOfTurn`. This is the s } ``` -**Key fields:** -- `speaker_id` — speaker label (e.g. `S1`, `S2`) -- `is_active` — whether this speaker is in your focus list (see [Speaker focus](#speaker-focus)) -- `is_eou` — `true` on final segments -- `start_time` / `end_time` — time in seconds relative to session start -- `processing_time` (message-level `metadata`) — transcription latency in seconds +**Segment fields:** +- `speaker_id` — speaker label (e.g. `S1`, `S2`, or a custom label if using [Speaker ID](#speaker-id)) +- `is_active` — `true` if this speaker is in your current focus list; `false` if they are a background speaker (see [Speaker Focus](#speaker-focus)) +- `is_eou` — `true` on final segments, `false` on partials +- `text` — clean, punctuated transcript text +- `metadata.start_time` / `metadata.end_time` — time range of this segment in seconds from session start + +**Message-level fields:** +- `metadata.processing_time` — transcription latency in seconds for this message -#### `SpeakerStarted` / `SpeakerEnded` +#### SpeakerStarted / SpeakerEnded -Emitted when a specific speaker starts or stops speaking. Useful for multi-party conversations. +Emitted when a specific speaker starts or stops being heard. These are voice activity events — they fire based on detected speech, independently of turn boundaries. ```json { "message": "SpeakerStarted", - "is_active": true, "speaker_id": "S1", + "is_active": true, "time": 0.84, "metadata": { "start_time": 0.84, "end_time": 0.84 } } @@ -365,21 +484,23 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party ```json { "message": "SpeakerEnded", - "is_active": false, "speaker_id": "S1", + "is_active": true, "time": 3.24, "metadata": { "start_time": 0.84, "end_time": 3.24 } } ``` -**Key fields:** -- `time` — seconds of audio from session start when the speaker activity occurred -- `metadata.start_time` — when that speaker started their current speaking interval -- `metadata.end_time` (`SpeakerEnded` only) — when that speaker stopped speaking +**Fields:** +- `speaker_id` — the speaker whose activity changed +- `is_active` — whether this speaker is in your current focus list +- `time` — seconds from session start when the activity was detected +- `metadata.start_time` — when this speaker started their current speaking interval +- `metadata.end_time` — when this speaker stopped speaking (`SpeakerEnded` only) -#### `SessionMetrics` / `SpeakerMetrics` +#### SessionMetrics -`SessionMetrics` is emitted every 5 seconds and at the end of the session. `SpeakerMetrics` is emitted each time a speaker speaks a word. +Emitted every 5 seconds and once at the end of the session. ```json { @@ -391,6 +512,10 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party } ``` +#### SpeakerMetrics + +Emitted each time a speaker produces a recognised word. + ```json { "message": "SpeakerMetrics", @@ -405,49 +530,82 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party } ``` ---- - -## Speaker focus - -You can update speaker focus mid-session using `UpdateSpeakerFocus`. This is a Voice Agent API feature — sending it in standard RT mode has no effect. +#### SpeakersResult -Diarization is enabled by default when using the Voice Agent API. Speaker IDs (`S1`, `S2`, etc.) are assigned automatically and persist across the session. +Emitted in response to `GetSpeakers`. Contains voice identifiers for all diarized speakers so far. See [Speaker ID](#speaker-id) for how to store and use these. ```json { - "message": "UpdateSpeakerFocus", - "speaker_focus": { - "focus_speakers": ["S1"], - "ignore_speakers": [], - "focus_mode": "retain" - } + "message": "SpeakersResult", + "speakers": [ + { "label": "S1", "speaker_identifiers": [""] }, + { "label": "S2", "speaker_identifiers": [""] } + ] } ``` -**`focus_mode` options:** +#### EndOfTurnPrediction + +Emitted by `adaptive` and `smart` profiles when the model predicts the current turn is about to end. Can be used to begin preparing a response before `EndOfTurn` arrives, reducing perceived latency. -- `retain` — non-focused speakers remain in output as passive speakers (`is_active: false`) -- `ignore` — non-focused speakers are excluded from output entirely +:::note +todo - payload details. +::: -The new config replaces the existing config immediately. +#### SmartTurnPrediction + +Emitted by the `smart` profile only. A higher-confidence acoustic prediction of turn completion, based on the ML model that analyses vocal cues. + +:::note +todo - payload details. +::: + +#### SpeechStarted / SpeechEnded + +Voice activity detection events. Emitted when speech is first detected in the audio stream (`SpeechStarted`) or stops (`SpeechEnded`). These fire independently of speaker identity and turn boundaries. + +:::note +todo - payload details. +::: --- -## Speaker ID +## Features -Speaker identifiers let you recognise known speakers across sessions. Once you have identifiers for a speaker, you can pass them into future sessions so the system tags them with a consistent label rather than a generic `S1`, `S2`. +### Speaker Focus -### Getting identifiers — `GetSpeakers` +Speaker focus lets you control which speakers' output your agent acts on. By default, all detected speakers are active and their transcripts are included in `AddSegment` output. -Send `GetSpeakers` during a session to request identifiers for all diarized speakers so far: +Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled, and persist for the lifetime of the session. Send `UpdateSpeakerFocus` at any point during the session to change who is in focus — the new config takes effect immediately and replaces the previous one. ```json { - "message": "GetSpeakers" + "message": "UpdateSpeakerFocus", + "speaker_focus": { + "focus_speakers": ["S1"], + "ignore_speakers": ["S3"], + "focus_mode": "retain" + } } ``` -The server responds with a `SpeakersResult` message: +**Fields:** + +- `focus_speakers` — speaker IDs to treat as active. Their segments appear with `is_active: true`. +- `ignore_speakers` — speaker IDs to exclude entirely. Their speech is dropped and does not affect turn detection. +- `focus_mode` — what happens to speakers who are neither in `focus_speakers` nor `ignore_speakers`: + - `retain` — they remain in the output as passive speakers (`is_active: false`) + - `ignore` — they are excluded from the output entirely + +### Speaker ID + +Speaker ID lets you recognise the same person across separate sessions. At the end of a session, you can retrieve voice identifiers for each speaker and store them. In future sessions, pass those identifiers into `StartRecognition` and the system will tag matching speakers with a consistent label rather than a generic `S1`, `S2`. + +#### Getting identifiers + +Send `GetSpeakers` at any point during a session to retrieve identifiers for all diarized speakers so far. The server responds with a `SpeakersResult` message. + +`SpeakersResult` response: ```json { @@ -459,11 +617,11 @@ The server responds with a `SpeakersResult` message: } ``` -Store the `speaker_identifiers` values — these are opaque tokens that represent the speaker's voice profile. +Store the `speaker_identifiers` values. These are opaque tokens tied to a speaker's voice profile — treat them as credentials and store them securely. -### Using identifiers in future sessions +#### Using identifiers in future sessions -Pass stored identifiers into `StartRecognition` via `known_speakers`. You can assign any label you like: +Pass stored identifiers into `StartRecognition` via `transcription_config.known_speakers`. You can assign any label: ```json { @@ -478,18 +636,25 @@ Pass stored identifiers into `StartRecognition` via `known_speakers`. You can as } ``` -When those speakers are detected, segments will be tagged with `"Alice"` or `"Bob"` instead of generic labels. Any unrecognised speakers are still assigned generic labels (`S1`, `S2`, etc.). +When those speakers are detected, their segments will carry `"Alice"` or `"Bob"` as the `speaker_id` instead of generic labels. Any unrecognised speakers are still assigned generic labels (`S1`, `S2`, etc.). + +--- + +## Code Examples + +For working code examples in Python and JavaScript, see the [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer). --- ## Feedback -This is a preview and your feedback shapes what goes to GA. We'd love to hear from you — whether that's something that didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly. +This is a preview and your feedback shapes what goes to GA (General Availability). +We'd love to hear from you — Tell us what works well, which features you use, whether something didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly. Specific areas of interest: -- integration experience (documentation, SDKs, API messages/metadata) -- Accuracy/Latency (including data capture if it's relevant (e.g. phone numbers, spell outs of names/account numbers) +- Integration experience (documentation, SDKs, API messages/metadata) +- Accuracy and latency (including data capture if it's relevant. E.g. phone numbers, spell outs of names/account numbers) - Turn detection and experience with different profiles - Any missing capabilities which would make your product better - What would stop you using this in production