-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Context
We're using the RT transcription API with speaker recognition (known speakers via speaker_diarization_config.speakers). We submit known speakers with their stored speaker_identifiers in StartRecognition, and read back speaker_identifiers from the SpeakersResult message at end of session.
This is working well for identification, but we've hit a question about managing speaker_identifiers over time.
Observed Behaviour
When we submit known speakers with their stored speaker_identifiers, the SpeakersResult always returns:
- The same number of
speaker_identifiersper speaker as we submitted - Different byte values — the first ~97 bytes (appears to be a format/header prefix) are identical, but the remaining voice data bytes differ every session
- This happens even for speakers who did not speak during the session
Example: we submit 3 identifiers for Speaker A and 1 for Speaker B. We always get back exactly 3 for Speaker A and 1 for Speaker B, all with modified values.
The Problem
Because the returned identifiers are always re-encoded, we can't distinguish between:
- Unchanged identifiers — voice data that was submitted and passed through (no new audio for this speaker)
- Updated identifiers — voice data that was refined with new audio from the session
- New identifiers — a genuinely new voice embedding captured from session audio
This makes it impossible to maintain a reliable speaker identifier set over time. Specifically:
- We can't tell if a returned identifier is "better" than what we sent (should we replace?)
- We can't detect when a new identifier has been captured vs an existing one re-encoded
- If a speaker is misidentified and we correct it, we can't safely move identifiers to the correct speaker because we don't know which are real vs re-encoded copies of the wrong speaker's data
Use Case
We store multiple speaker_identifiers per speaker to improve recognition across different contexts (different microphones, in-person vs remote, etc.). When a user corrects a misidentification, we need to know which identifiers to move to the correct speaker profile.
Questions
- Why are returned identifiers re-encoded? Is there a session-specific salt/nonce in the encoding, or are they genuinely refined each time?
- Is there a way to get identifiers returned unchanged so we can track which ones were updated vs passed through?
- When a known speaker speaks during a session, does the returned set include any new identifier derived from the session audio, or is it always the same count as submitted?
- What is the recommended strategy for maintaining a speaker's identifier set over time — should we replace stored identifiers with returned ones, or keep the originals?
Environment
- API: Real-time transcription WebSocket (not using the Python SDK directly, but the RT API)
- Feature:
speaker_diarization_configwithspeakersarray andget_speakers: true
Any guidance would be really appreciated — this is blocking our speaker profile management feature.