[QUESTION] SpeakersResult returns re-encoded speaker_identifiers — how to track changes?

## Context

We're using the RT transcription API with speaker recognition (known speakers via `speaker_diarization_config.speakers`). We submit known speakers with their stored `speaker_identifiers` in `StartRecognition`, and read back `speaker_identifiers` from the `SpeakersResult` message at end of session.

This is working well for identification, but we've hit a question about managing `speaker_identifiers` over time.

## Observed Behaviour

When we submit known speakers with their stored `speaker_identifiers`, the `SpeakersResult` always returns:

- **The same number** of `speaker_identifiers` per speaker as we submitted
- **Different byte values** — the first ~97 bytes (appears to be a format/header prefix) are identical, but the remaining voice data bytes differ every session
- This happens **even for speakers who did not speak** during the session

Example: we submit 3 identifiers for Speaker A and 1 for Speaker B. We always get back exactly 3 for Speaker A and 1 for Speaker B, all with modified values.

## The Problem

Because the returned identifiers are always re-encoded, we can't distinguish between:

1. **Unchanged identifiers** — voice data that was submitted and passed through (no new audio for this speaker)
2. **Updated identifiers** — voice data that was refined with new audio from the session
3. **New identifiers** — a genuinely new voice embedding captured from session audio

This makes it impossible to maintain a reliable speaker identifier set over time. Specifically:

- We can't tell if a returned identifier is "better" than what we sent (should we replace?)
- We can't detect when a new identifier has been captured vs an existing one re-encoded
- If a speaker is misidentified and we correct it, we can't safely move identifiers to the correct speaker because we don't know which are real vs re-encoded copies of the wrong speaker's data

## Use Case

We store multiple `speaker_identifiers` per speaker to improve recognition across different contexts (different microphones, in-person vs remote, etc.). When a user corrects a misidentification, we need to know which identifiers to move to the correct speaker profile.

## Questions

1. **Why are returned identifiers re-encoded?** Is there a session-specific salt/nonce in the encoding, or are they genuinely refined each time?
2. **Is there a way to get identifiers returned unchanged** so we can track which ones were updated vs passed through?
3. **When a known speaker speaks during a session, does the returned set include any new identifier** derived from the session audio, or is it always the same count as submitted?
4. **What is the recommended strategy** for maintaining a speaker's identifier set over time — should we replace stored identifiers with returned ones, or keep the originals?

## Environment

- API: Real-time transcription WebSocket (not using the Python SDK directly, but the RT API)
- Feature: `speaker_diarization_config` with `speakers` array and `get_speakers: true`

Any guidance would be really appreciated — this is blocking our speaker profile management feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] SpeakersResult returns re-encoded speaker_identifiers — how to track changes? #98

Context

Observed Behaviour

The Problem

Use Case

Questions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] SpeakersResult returns re-encoded speaker_identifiers — how to track changes? #98

Description

Context

Observed Behaviour

The Problem

Use Case

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions