Skip to content

[QUESTION] SpeakersResult returns re-encoded speaker_identifiers — how to track changes? #98

@andycop

Description

@andycop

Context

We're using the RT transcription API with speaker recognition (known speakers via speaker_diarization_config.speakers). We submit known speakers with their stored speaker_identifiers in StartRecognition, and read back speaker_identifiers from the SpeakersResult message at end of session.

This is working well for identification, but we've hit a question about managing speaker_identifiers over time.

Observed Behaviour

When we submit known speakers with their stored speaker_identifiers, the SpeakersResult always returns:

  • The same number of speaker_identifiers per speaker as we submitted
  • Different byte values — the first ~97 bytes (appears to be a format/header prefix) are identical, but the remaining voice data bytes differ every session
  • This happens even for speakers who did not speak during the session

Example: we submit 3 identifiers for Speaker A and 1 for Speaker B. We always get back exactly 3 for Speaker A and 1 for Speaker B, all with modified values.

The Problem

Because the returned identifiers are always re-encoded, we can't distinguish between:

  1. Unchanged identifiers — voice data that was submitted and passed through (no new audio for this speaker)
  2. Updated identifiers — voice data that was refined with new audio from the session
  3. New identifiers — a genuinely new voice embedding captured from session audio

This makes it impossible to maintain a reliable speaker identifier set over time. Specifically:

  • We can't tell if a returned identifier is "better" than what we sent (should we replace?)
  • We can't detect when a new identifier has been captured vs an existing one re-encoded
  • If a speaker is misidentified and we correct it, we can't safely move identifiers to the correct speaker because we don't know which are real vs re-encoded copies of the wrong speaker's data

Use Case

We store multiple speaker_identifiers per speaker to improve recognition across different contexts (different microphones, in-person vs remote, etc.). When a user corrects a misidentification, we need to know which identifiers to move to the correct speaker profile.

Questions

  1. Why are returned identifiers re-encoded? Is there a session-specific salt/nonce in the encoding, or are they genuinely refined each time?
  2. Is there a way to get identifiers returned unchanged so we can track which ones were updated vs passed through?
  3. When a known speaker speaks during a session, does the returned set include any new identifier derived from the session audio, or is it always the same count as submitted?
  4. What is the recommended strategy for maintaining a speaker's identifier set over time — should we replace stored identifiers with returned ones, or keep the originals?

Environment

  • API: Real-time transcription WebSocket (not using the Python SDK directly, but the RT API)
  • Feature: speaker_diarization_config with speakers array and get_speakers: true

Any guidance would be really appreciated — this is blocking our speaker profile management feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions