feat(inworld): add STT plugin with voice profiling#1516
Conversation
Ports the Python livekit-plugins-inworld STT implementation to TypeScript. Adds streaming (WebSocket) and batch (REST) modes, word-level timestamps, and typed VoiceProfile with emotion/accent/age/pitch/vocalStyle dimensions. Also registers InworldSTTModels in the inference STT type union.
Ports the Python livekit-plugins-inworld STT implementation to TypeScript. Adds streaming (WebSocket) and batch (REST) modes, word-level timestamps, and typed VoiceProfile with emotion/accent/age/pitch/vocalStyle dimensions. Also registers InworldSTTModels in the inference STT type union.
|
|
|
||
| await Promise.race([ | ||
| this.#resetWS.await, | ||
| Promise.all([sendTask(), listenTask.result, wsMonitor]), |
There was a problem hiding this comment.
🔴 wsMonitor Task object is not thenable, making WebSocket close detection a no-op in Promise.all
On line 479, wsMonitor (a Task<void> object) is passed directly to Promise.all instead of wsMonitor.result (a Promise<void>). The Task class at agents/src/utils.ts:492 does not implement a .then() method, so it is not thenable. Promise.all treats non-thenable values as immediately resolved, meaning the WebSocket close monitor never actually participates in error propagation.
This causes the stream to hang if the WebSocket closes unexpectedly while sendTask is blocked waiting for audio input on this.input.next() (line 344). Neither sendTask nor listenTask will detect the closure until new audio data arrives and ws.send() fails. In a silence scenario (no audio input), the stream hangs indefinitely.
| Promise.all([sendTask(), listenTask.result, wsMonitor]), | |
| Promise.all([sendTask(), listenTask.result, wsMonitor.result]), |
Was this helpful? React with 👍 or 👎 to provide feedback.
Description
Ports the Python
livekit-plugins-inworldSTT implementation to TypeScript, adding the missing STT capability to the existing Inworld plugin (which previously only had TTS).Changes Made
plugins/inworld/src/stt.ts— newSTTclass with both streaming (bidirectional WebSocket towss://api.inworld.ai/stt/v1/transcribe:streamBidirectional) and batch (RESTPOST /stt/v1/transcribe) modes; includes word-level timestamp mapping, periodic audio duration reporting, and exponential-backoff reconnectionplugins/inworld/src/index.ts— exports the newSTTandSpeechStreamclassesagents/src/inference/stt.ts— addsInworldSTTModels = 'inworld/inworld-stt-1'to the inference STT type unionWhen
enableVoiceProfileistrue(default), each transcript includes an acousticVoiceProfileinSpeechData.metadata.voice_profilewith typed fields foremotion,accent,age,pitch, andvocalStyle.Pre-Review Checklist
Testing
pnpm build,pnpm --filter @livekit/agents-plugin-inworld build)pnpm -w lint)pnpm -w format:write)Additional Notes
The
VoiceProfileresponse schema is not publicly documented by Inworld. The interface uses known dimension names (emotion,accent,age,pitch,vocalStyle) based on their API resources page, with an index signature ([key: string]: unknown) to handle any undocumented fields. Word timestamps handle bothstartTime/endTime(streaming, seconds) andstartTimeMs/endTimeMs(REST, milliseconds) naming conventions.Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.