feat(whisper): add VAD-based streaming to WebSocket transcription endpoint#318
feat(whisper): add VAD-based streaming to WebSocket transcription endpoint#318basnijholt wants to merge 8 commits intomainfrom
Conversation
Code ReviewOverall the implementation looks solid - clean separation of concerns and good reuse of existing Minor Concerns
Questions
|
ce67d96 to
25bee0f
Compare
…onse - Extract duplicate segment processing into inner process_segment() function - Add duration tracking to VAD mode for consistent response format - Add 3 new tests: single segment, multiple segments, VAD not available - Remove unused .claude/REPORT.md from commit
a4a8e77 to
1d9c197
Compare
… loads The module-level try/except was useless since VoiceActivityDetector imports onnxruntime lazily in __init__, not at module load time.
- Remove ImportError handling since onnxruntime is a transitive dep of faster-whisper - Let _transcribe_segment exceptions propagate instead of silently returning None - Remove test for VAD unavailability (no longer a valid scenario)
Ensures onnxruntime (via vad extra) is available for all whisper backends (faster-whisper, mlx-whisper, whisper-transformers), making VAD streaming work out of the box.
Reverts adding VAD as a dependency for all whisper backends. onnxruntime adds ~138MB which is significant for mlx-whisper users. Users who want VAD streaming should install agent-cli[vad] separately.
- _parse_eos(): parse EOS marker from data, returns (audio_chunk, is_eos) - _wrap_pcm_as_wav(): wrap raw PCM in WAV format Simplifies both _stream_with_vad and _stream_buffered.
Replace local _wrap_pcm_as_wav with pcm_to_wav from agent_cli.services. Removes duplicate code and unused wave/io imports.
Summary
Add Voice Activity Detection (VAD) to the WebSocket endpoint
/v1/audio/transcriptions/streamso remote clients receive partial transcriptions as speech segments complete, rather than waiting for the entire audio stream.VoiceActivityDetectorfromagent_cli/core/vadtranscribe-daemonto the WebSocket handleruse_vad=falsequery parameterNote
Non-standard endpoint: The WebSocket streaming endpoint (
/v1/audio/transcriptions/stream) is a custom extension, not part of the OpenAI API specification. OpenAI's standard transcription API is REST-only (POST /v1/audio/transcriptions). While WebSocket streaming for ASR is common in the industry (Deepgram, AssemblyAI, AWS Transcribe, Azure Speech), there is no universal standard format. This endpoint is provided for advanced use cases but may not be compatible with other OpenAI-compatible clients.New Query Parameters
use_vadtruevad_threshold0.3vad_silence_ms1000vad_min_speech_ms250Message Protocol
Compatible with existing format:
{"type": "partial", "text": "first utterance", "is_final": false, "language": "en"} {"type": "partial", "text": "second utterance", "is_final": false, "language": "en"} {"type": "final", "text": "first utterance second utterance", "is_final": true}Closes #263
Test plan
pytest tests/- all 909 tests passpre-commit run --all-files- all checks pass