feat(whisper): add VAD-based streaming to WebSocket transcription endpoint by basnijholt · Pull Request #318 · basnijholt/agent-cli

basnijholt · 2026-01-26T08:10:25Z

Summary

Add Voice Activity Detection (VAD) to the WebSocket endpoint /v1/audio/transcriptions/stream so remote clients receive partial transcriptions as speech segments complete, rather than waiting for the entire audio stream.

Reuses the existing VoiceActivityDetector from agent_cli/core/vad
Ports the VAD pattern from transcribe-daemon to the WebSocket handler
New query parameters for VAD configuration with sensible defaults
Backward compatible via use_vad=false query parameter

Note

Non-standard endpoint: The WebSocket streaming endpoint (/v1/audio/transcriptions/stream) is a custom extension, not part of the OpenAI API specification. OpenAI's standard transcription API is REST-only (POST /v1/audio/transcriptions). While WebSocket streaming for ASR is common in the industry (Deepgram, AssemblyAI, AWS Transcribe, Azure Speech), there is no universal standard format. This endpoint is provided for advanced use cases but may not be compatible with other OpenAI-compatible clients.

New Query Parameters

Parameter	Type	Default	Description
`use_vad`	bool	`true`	Enable VAD for streaming partial results
`vad_threshold`	float	`0.3`	Speech detection threshold (0.0-1.0)
`vad_silence_ms`	int	`1000`	Silence duration (ms) to end segment
`vad_min_speech_ms`	int	`250`	Min speech duration (ms) to trigger transcription

Message Protocol

Compatible with existing format:

{"type": "partial", "text": "first utterance", "is_final": false, "language": "en"}
{"type": "partial", "text": "second utterance", "is_final": false, "language": "en"}
{"type": "final", "text": "first utterance second utterance", "is_final": true}

Closes #263

Test plan

Run pytest tests/ - all 909 tests pass
Run pre-commit run --all-files - all checks pass
Manual testing with a WebSocket client sending real audio

basnijholt · 2026-01-26T18:24:25Z

Code Review

Overall the implementation looks solid - clean separation of concerns and good reuse of existing VoiceActivityDetector.

Minor Concerns

Final message differs between modes

In buffered mode, the final message includes segments and duration:
```
{"type": "final", "text": "...", "segments": [...], "duration": 5.2, ...}
```
In VAD mode, these are missing:
```
{"type": "final", "text": "...", "language": "...", "is_final": true}
```
This is probably intentional since VAD combines multiple independent transcriptions, but clients expecting duration may break. Consider adding a combined duration or documenting this difference.
Silent segment failures

_transcribe_segment() catches all exceptions and returns None:
```
except Exception:
    logger.exception("Failed to transcribe segment")
    return None
```
This means if a segment fails to transcribe, the client never knows - it just won't appear in the final text. Consider sending an error/warning message to the client when a segment fails, or at minimum noting in the final message how many segments were processed vs failed.

Questions

Should use_vad=true be the default? This changes behavior for existing clients. Maybe use_vad=false as default for backward compat, with documentation encouraging VAD mode?

…onse - Extract duplicate segment processing into inner process_segment() function - Add duration tracking to VAD mode for consistent response format - Add 3 new tests: single segment, multiple segments, VAD not available - Remove unused .claude/REPORT.md from commit

… loads The module-level try/except was useless since VoiceActivityDetector imports onnxruntime lazily in __init__, not at module load time.

- Remove ImportError handling since onnxruntime is a transitive dep of faster-whisper - Let _transcribe_segment exceptions propagate instead of silently returning None - Remove test for VAD unavailability (no longer a valid scenario)

Ensures onnxruntime (via vad extra) is available for all whisper backends (faster-whisper, mlx-whisper, whisper-transformers), making VAD streaming work out of the box.

Reverts adding VAD as a dependency for all whisper backends. onnxruntime adds ~138MB which is significant for mlx-whisper users. Users who want VAD streaming should install agent-cli[vad] separately.

- _parse_eos(): parse EOS marker from data, returns (audio_chunk, is_eos) - _wrap_pcm_as_wav(): wrap raw PCM in WAV format Simplifies both _stream_with_vad and _stream_buffered.

Replace local _wrap_pcm_as_wav with pcm_to_wav from agent_cli.services. Removes duplicate code and unused wave/io imports.

basnijholt force-pushed the websocket-vad-streaming branch 2 times, most recently from ce67d96 to 25bee0f Compare January 29, 2026 13:43

basnijholt force-pushed the websocket-vad-streaming branch from a4a8e77 to 1d9c197 Compare January 30, 2026 23:34

basnijholt added 7 commits February 4, 2026 08:13

Merge remote-tracking branch 'origin/main' into websocket-vad-streaming

26a53b7

fix: move VAD import to instantiation time where onnxruntime actually…

e912450

… loads The module-level try/except was useless since VoiceActivityDetector imports onnxruntime lazily in __init__, not at module load time.

feat: add VAD as dependency for all whisper backends

6ad9ed1

Ensures onnxruntime (via vad extra) is available for all whisper backends (faster-whisper, mlx-whisper, whisper-transformers), making VAD streaming work out of the box.

revert: keep VAD as optional dependency

9f52993

Reverts adding VAD as a dependency for all whisper backends. onnxruntime adds ~138MB which is significant for mlx-whisper users. Users who want VAD streaming should install agent-cli[vad] separately.

refactor: extract common helpers for DRY code

fc1e850

- _parse_eos(): parse EOS marker from data, returns (audio_chunk, is_eos) - _wrap_pcm_as_wav(): wrap raw PCM in WAV format Simplifies both _stream_with_vad and _stream_buffered.

refactor: reuse existing pcm_to_wav from services

cb6de4b

Replace local _wrap_pcm_as_wav with pcm_to_wav from agent_cli.services. Removes duplicate code and unused wave/io imports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(whisper): add VAD-based streaming to WebSocket transcription endpoint#318

feat(whisper): add VAD-based streaming to WebSocket transcription endpoint#318
basnijholt wants to merge 8 commits intomainfrom
websocket-vad-streaming

basnijholt commented Jan 26, 2026 •

edited

Loading

Uh oh!

basnijholt commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

basnijholt commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Query Parameters

Message Protocol

Test plan

Uh oh!

basnijholt commented Jan 26, 2026

Code Review

Minor Concerns

Questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

basnijholt commented Jan 26, 2026 •

edited

Loading