Skip to content

(speechmatics + inference): add VAD#5750

Merged
tinalenguyen merged 6 commits into
mainfrom
tina/speechmatics-vad
May 18, 2026
Merged

(speechmatics + inference): add VAD#5750
tinalenguyen merged 6 commits into
mainfrom
tina/speechmatics-vad

Conversation

@tinalenguyen
Copy link
Copy Markdown
Member

@tinalenguyen tinalenguyen commented May 16, 2026

When set to EXTERNAL mode, the Speechmatics STT needs finalize() to be called to flush the partial transcripts and mark as end of speech. We pass a VAD, similar to Mistral's plugin, or initialize one so it works right out the box.

We mirror the plugin in the inference code; we also accept/initialize a VAD. Inference is already set to handle the VAD event.

@chenghao-mou chenghao-mou requested a review from a team May 16, 2026 05:30
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

if is_given(parsed_language) and not is_given(language):
language = parsed_language

is_speechmatics, vad = _resolve_vad_for_model(model, vad if is_given(vad) else None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _resolve_vad_for_model conflates vad=None (opt-out) with vad=NOT_GIVEN (auto-load), making it impossible to opt out of VAD for Speechmatics models

When a user explicitly passes vad=None to the inference STT(model="speechmatics/enhanced", vad=None), the intent (mirroring the direct Speechmatics plugin's documented contract at livekit-plugins/livekit-plugins-speechmatics/livekit/plugins/speechmatics/stt.py:258-260) is to opt out of auto-loaded VAD. However, at line 474, is_given(None) returns True (since None is not a NotGiven instance), so vad_instance is passed as None to _resolve_vad_for_model. Inside that function (line 220), the condition is_speechmatics and vad_instance is None triggers Silero auto-loading regardless, making opt-out impossible. The Speechmatics plugin correctly distinguishes these cases using not is_given(vad) at stt.py:277.

Prompt for agents
The _resolve_vad_for_model function needs to distinguish between vad=NOT_GIVEN (auto-load Silero for Speechmatics) and vad=None (explicit opt-out). Currently the calling code at line 474 converts both to None before calling the function. One approach: pass a sentinel or boolean flag to _resolve_vad_for_model indicating whether the user explicitly provided a vad value. For example, add an auto_load: bool parameter that is True only when vad was NOT_GIVEN. In _resolve_vad_for_model, only auto-load Silero when is_speechmatics and auto_load is True. When auto_load is False and vad_instance is None, skip the auto-load. This mirrors the direct Speechmatics plugin's logic at speechmatics/stt.py:277 which uses not is_given(vad) to decide auto-loading.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

@tinalenguyen tinalenguyen May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of right now, speechmatics inference STT models need a VAD to be run here. it is possible they will support server-side endpointing in the future. this differs from the speechmatics stt plugin approach, which already exposes finalize() which allows users to flush end-of-speech on their own

and not is_given(min_endpointing_delay)
and not _user_provided_turn_handling
):
endpointing["min_delay"] = 0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_endpointing_delay is deprecated and we should check if user specified min_delay in turn_handling? also, should we move this to a separate pr since it's not related to speechmatics plugin.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, i also thought that the stt capability would play well with indicating STTs that need VAD. i removed those changes here for now though

@tinalenguyen tinalenguyen changed the title (speechmatics): add VAD and server_endpointing capability (speechmatics + inference): add VAD May 17, 2026
Comment on lines +207 to +231
def _resolve_vad_for_model(
model: NotGivenOr[STTModels | str],
vad_instance: vad.VAD | None,
) -> vad.VAD | None:
is_speechmatics = (
is_given(model) and isinstance(model, str) and model.startswith("speechmatics/")
)
if vad_instance is not None and not is_speechmatics:
logger.warning(
"`vad` will be ignored: model %r handles endpointing server-side.",
model,
)
return None
if is_speechmatics and vad_instance is None:
try:
from livekit.plugins.silero import VAD as SileroVAD
except ImportError as e:
raise ImportError(
"livekit-plugins-silero is required: model "
f"{model!r} does not handle endpointing server-side."
) from e
vad_instance = SileroVAD.load()
return vad_instance


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where AgentSession has VAD wouldn't this mean we have 2 VAD instances?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, to use just 1 would require the user to store it and pass the same instance

maybe it would be helpful for the user to have separate settings for stt and session level vad, but as of right now the session vad can't be connected to stt

Copy link
Copy Markdown
Contributor

@longcw longcw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! something nit:

Comment on lines +726 to +730
vad_task.cancel()
try:
await vad_task
except asyncio.CancelledError:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use utils.cancel_and_wait

if ws.closed:
return
try:
await ws.send_str(json.dumps({"type": "session.finalize"}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one questions, what will happen if VAD fires EOS on noise, will the STT return an empty or a random transcript?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe the STT will return an empty one ""

@tinalenguyen tinalenguyen merged commit a3df48f into main May 18, 2026
24 checks passed
@tinalenguyen tinalenguyen deleted the tina/speechmatics-vad branch May 18, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants