(speechmatics + inference): add VAD#5750
Conversation
| if is_given(parsed_language) and not is_given(language): | ||
| language = parsed_language | ||
|
|
||
| is_speechmatics, vad = _resolve_vad_for_model(model, vad if is_given(vad) else None) |
There was a problem hiding this comment.
🟡 _resolve_vad_for_model conflates vad=None (opt-out) with vad=NOT_GIVEN (auto-load), making it impossible to opt out of VAD for Speechmatics models
When a user explicitly passes vad=None to the inference STT(model="speechmatics/enhanced", vad=None), the intent (mirroring the direct Speechmatics plugin's documented contract at livekit-plugins/livekit-plugins-speechmatics/livekit/plugins/speechmatics/stt.py:258-260) is to opt out of auto-loaded VAD. However, at line 474, is_given(None) returns True (since None is not a NotGiven instance), so vad_instance is passed as None to _resolve_vad_for_model. Inside that function (line 220), the condition is_speechmatics and vad_instance is None triggers Silero auto-loading regardless, making opt-out impossible. The Speechmatics plugin correctly distinguishes these cases using not is_given(vad) at stt.py:277.
Prompt for agents
The _resolve_vad_for_model function needs to distinguish between vad=NOT_GIVEN (auto-load Silero for Speechmatics) and vad=None (explicit opt-out). Currently the calling code at line 474 converts both to None before calling the function. One approach: pass a sentinel or boolean flag to _resolve_vad_for_model indicating whether the user explicitly provided a vad value. For example, add an auto_load: bool parameter that is True only when vad was NOT_GIVEN. In _resolve_vad_for_model, only auto-load Silero when is_speechmatics and auto_load is True. When auto_load is False and vad_instance is None, skip the auto-load. This mirrors the direct Speechmatics plugin's logic at speechmatics/stt.py:277 which uses not is_given(vad) to decide auto-loading.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
as of right now, speechmatics inference STT models need a VAD to be run here. it is possible they will support server-side endpointing in the future. this differs from the speechmatics stt plugin approach, which already exposes finalize() which allows users to flush end-of-speech on their own
| and not is_given(min_endpointing_delay) | ||
| and not _user_provided_turn_handling | ||
| ): | ||
| endpointing["min_delay"] = 0.0 |
There was a problem hiding this comment.
min_endpointing_delay is deprecated and we should check if user specified min_delay in turn_handling? also, should we move this to a separate pr since it's not related to speechmatics plugin.
There was a problem hiding this comment.
makes sense, i also thought that the stt capability would play well with indicating STTs that need VAD. i removed those changes here for now though
| def _resolve_vad_for_model( | ||
| model: NotGivenOr[STTModels | str], | ||
| vad_instance: vad.VAD | None, | ||
| ) -> vad.VAD | None: | ||
| is_speechmatics = ( | ||
| is_given(model) and isinstance(model, str) and model.startswith("speechmatics/") | ||
| ) | ||
| if vad_instance is not None and not is_speechmatics: | ||
| logger.warning( | ||
| "`vad` will be ignored: model %r handles endpointing server-side.", | ||
| model, | ||
| ) | ||
| return None | ||
| if is_speechmatics and vad_instance is None: | ||
| try: | ||
| from livekit.plugins.silero import VAD as SileroVAD | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "livekit-plugins-silero is required: model " | ||
| f"{model!r} does not handle endpointing server-side." | ||
| ) from e | ||
| vad_instance = SileroVAD.load() | ||
| return vad_instance | ||
|
|
||
|
|
There was a problem hiding this comment.
In the case where AgentSession has VAD wouldn't this mean we have 2 VAD instances?
There was a problem hiding this comment.
yes, to use just 1 would require the user to store it and pass the same instance
maybe it would be helpful for the user to have separate settings for stt and session level vad, but as of right now the session vad can't be connected to stt
| vad_task.cancel() | ||
| try: | ||
| await vad_task | ||
| except asyncio.CancelledError: | ||
| pass |
There was a problem hiding this comment.
nit: use utils.cancel_and_wait
| if ws.closed: | ||
| return | ||
| try: | ||
| await ws.send_str(json.dumps({"type": "session.finalize"})) |
There was a problem hiding this comment.
one questions, what will happen if VAD fires EOS on noise, will the STT return an empty or a random transcript?
There was a problem hiding this comment.
i believe the STT will return an empty one ""
When set to
EXTERNALmode, the Speechmatics STT needsfinalize()to be called to flush the partial transcripts and mark as end of speech. We pass a VAD, similar to Mistral's plugin, or initialize one so it works right out the box.We mirror the plugin in the inference code; we also accept/initialize a VAD. Inference is already set to handle the VAD event.