Skip to content

feat(inworld): add STT plugin with voice profiling#1516

Open
karan-dhir wants to merge 2 commits into
livekit:mainfrom
karan-dhir:inworld-stt-voice-profiling
Open

feat(inworld): add STT plugin with voice profiling#1516
karan-dhir wants to merge 2 commits into
livekit:mainfrom
karan-dhir:inworld-stt-voice-profiling

Conversation

@karan-dhir
Copy link
Copy Markdown

Description

Ports the Python livekit-plugins-inworld STT implementation to TypeScript, adding the missing STT capability to the existing Inworld plugin (which previously only had TTS).

Changes Made

  • plugins/inworld/src/stt.ts — new STT class with both streaming (bidirectional WebSocket to wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional) and batch (REST POST /stt/v1/transcribe) modes; includes word-level timestamp mapping, periodic audio duration reporting, and exponential-backoff reconnection
  • plugins/inworld/src/index.ts — exports the new STT and SpeechStream classes
  • agents/src/inference/stt.ts — adds InworldSTTModels = 'inworld/inworld-stt-1' to the inference STT type union

When enableVoiceProfile is true (default), each transcript includes an acoustic VoiceProfile in SpeechData.metadata.voice_profile with typed fields for emotion, accent, age, pitch, and vocalStyle.

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Removed unnecessary comments and ensured code quality
  • Changes explained: All changes are properly documented and justified above
  • Scope appropriate: All changes relate to the PR title

Testing

  • Build passes (pnpm build, pnpm --filter @livekit/agents-plugin-inworld build)
  • Lint passes (pnpm -w lint)
  • Format passes (pnpm -w format:write)
  • Automated tests added/updated (voice profiling requires a live API key; unit test mocks pending)

Additional Notes

The VoiceProfile response schema is not publicly documented by Inworld. The interface uses known dimension names (emotion, accent, age, pitch, vocalStyle) based on their API resources page, with an index signature ([key: string]: unknown) to handle any undocumented fields. Word timestamps handle both startTime/endTime (streaming, seconds) and startTimeMs/endTimeMs (REST, milliseconds) naming conventions.


Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

Ports the Python livekit-plugins-inworld STT implementation to TypeScript.
Adds streaming (WebSocket) and batch (REST) modes, word-level timestamps,
and typed VoiceProfile with emotion/accent/age/pitch/vocalStyle dimensions.
Also registers InworldSTTModels in the inference STT type union.
Ports the Python livekit-plugins-inworld STT implementation to TypeScript.
Adds streaming (WebSocket) and batch (REST) modes, word-level timestamps,
and typed VoiceProfile with emotion/accent/age/pitch/vocalStyle dimensions.
Also registers InworldSTTModels in the inference STT type union.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: 33955ea

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review


await Promise.race([
this.#resetWS.await,
Promise.all([sendTask(), listenTask.result, wsMonitor]),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 wsMonitor Task object is not thenable, making WebSocket close detection a no-op in Promise.all

On line 479, wsMonitor (a Task<void> object) is passed directly to Promise.all instead of wsMonitor.result (a Promise<void>). The Task class at agents/src/utils.ts:492 does not implement a .then() method, so it is not thenable. Promise.all treats non-thenable values as immediately resolved, meaning the WebSocket close monitor never actually participates in error propagation.

This causes the stream to hang if the WebSocket closes unexpectedly while sendTask is blocked waiting for audio input on this.input.next() (line 344). Neither sendTask nor listenTask will detect the closure until new audio data arrives and ws.send() fails. In a silence scenario (no audio input), the stream hangs indefinitely.

Suggested change
Promise.all([sendTask(), listenTask.result, wsMonitor]),
Promise.all([sendTask(), listenTask.result, wsMonitor.result]),
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant