Skip to content

feat(whisper): add VAD-based streaming to WebSocket transcription endpoint#318

Open
basnijholt wants to merge 8 commits intomainfrom
websocket-vad-streaming
Open

feat(whisper): add VAD-based streaming to WebSocket transcription endpoint#318
basnijholt wants to merge 8 commits intomainfrom
websocket-vad-streaming

Conversation

@basnijholt
Copy link
Owner

@basnijholt basnijholt commented Jan 26, 2026

Summary

Add Voice Activity Detection (VAD) to the WebSocket endpoint /v1/audio/transcriptions/stream so remote clients receive partial transcriptions as speech segments complete, rather than waiting for the entire audio stream.

  • Reuses the existing VoiceActivityDetector from agent_cli/core/vad
  • Ports the VAD pattern from transcribe-daemon to the WebSocket handler
  • New query parameters for VAD configuration with sensible defaults
  • Backward compatible via use_vad=false query parameter

Note

Non-standard endpoint: The WebSocket streaming endpoint (/v1/audio/transcriptions/stream) is a custom extension, not part of the OpenAI API specification. OpenAI's standard transcription API is REST-only (POST /v1/audio/transcriptions). While WebSocket streaming for ASR is common in the industry (Deepgram, AssemblyAI, AWS Transcribe, Azure Speech), there is no universal standard format. This endpoint is provided for advanced use cases but may not be compatible with other OpenAI-compatible clients.

New Query Parameters

Parameter Type Default Description
use_vad bool true Enable VAD for streaming partial results
vad_threshold float 0.3 Speech detection threshold (0.0-1.0)
vad_silence_ms int 1000 Silence duration (ms) to end segment
vad_min_speech_ms int 250 Min speech duration (ms) to trigger transcription

Message Protocol

Compatible with existing format:

{"type": "partial", "text": "first utterance", "is_final": false, "language": "en"}
{"type": "partial", "text": "second utterance", "is_final": false, "language": "en"}
{"type": "final", "text": "first utterance second utterance", "is_final": true}

Closes #263

Test plan

  • Run pytest tests/ - all 909 tests pass
  • Run pre-commit run --all-files - all checks pass
  • Manual testing with a WebSocket client sending real audio

@basnijholt
Copy link
Owner Author

Code Review

Overall the implementation looks solid - clean separation of concerns and good reuse of existing VoiceActivityDetector.

Minor Concerns

  1. Final message differs between modes

    In buffered mode, the final message includes segments and duration:

    {"type": "final", "text": "...", "segments": [...], "duration": 5.2, ...}

    In VAD mode, these are missing:

    {"type": "final", "text": "...", "language": "...", "is_final": true}

    This is probably intentional since VAD combines multiple independent transcriptions, but clients expecting duration may break. Consider adding a combined duration or documenting this difference.

  2. Silent segment failures

    _transcribe_segment() catches all exceptions and returns None:

    except Exception:
        logger.exception("Failed to transcribe segment")
        return None

    This means if a segment fails to transcribe, the client never knows - it just won't appear in the final text. Consider sending an error/warning message to the client when a segment fails, or at minimum noting in the final message how many segments were processed vs failed.

Questions

  • Should use_vad=true be the default? This changes behavior for existing clients. Maybe use_vad=false as default for backward compat, with documentation encouraging VAD mode?

@basnijholt basnijholt force-pushed the websocket-vad-streaming branch 2 times, most recently from ce67d96 to 25bee0f Compare January 29, 2026 13:43
…onse

- Extract duplicate segment processing into inner process_segment() function
- Add duration tracking to VAD mode for consistent response format
- Add 3 new tests: single segment, multiple segments, VAD not available
- Remove unused .claude/REPORT.md from commit
@basnijholt basnijholt force-pushed the websocket-vad-streaming branch from a4a8e77 to 1d9c197 Compare January 30, 2026 23:34
… loads

The module-level try/except was useless since VoiceActivityDetector
imports onnxruntime lazily in __init__, not at module load time.
- Remove ImportError handling since onnxruntime is a transitive dep of faster-whisper
- Let _transcribe_segment exceptions propagate instead of silently returning None
- Remove test for VAD unavailability (no longer a valid scenario)
Ensures onnxruntime (via vad extra) is available for all whisper
backends (faster-whisper, mlx-whisper, whisper-transformers), making
VAD streaming work out of the box.
Reverts adding VAD as a dependency for all whisper backends.
onnxruntime adds ~138MB which is significant for mlx-whisper users.

Users who want VAD streaming should install agent-cli[vad] separately.
- _parse_eos(): parse EOS marker from data, returns (audio_chunk, is_eos)
- _wrap_pcm_as_wav(): wrap raw PCM in WAV format

Simplifies both _stream_with_vad and _stream_buffered.
Replace local _wrap_pcm_as_wav with pcm_to_wav from agent_cli.services.
Removes duplicate code and unused wave/io imports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add streaming support for TTS and ASR backends

1 participant

Comments