feat(vad): add POST /v1/vad/confidence — per-packet ONNX confidence endpoint#128
Open
chazmaniandinkle wants to merge 2 commits into
Open
feat(vad): add POST /v1/vad/confidence — per-packet ONNX confidence endpoint#128chazmaniandinkle wants to merge 2 commits into
chazmaniandinkle wants to merge 2 commits into
Conversation
…ndpoint
Adds a lightweight per-packet VAD endpoint for barge-in loops (Discord VC,
etc.) that do not need full utterance-level speech detection.
Why:
- /v1/vad accepts a WAV multipart upload and runs the full Silero torch path.
That's the right API for utterance VAD but overkill for 512-sample frames
arriving at 20ms intervals from a Discord SocketReader thread.
- The vendored pipecat SileroVADAnalyzer (vendor/pipecat_vad/) uses ONNX,
has no torch dependency, returns per-frame confidence in <5ms, and was
already present in main but had no HTTP surface.
Contract:
POST /v1/vad/confidence[?sample_rate=16000]
Body: raw int16 little-endian PCM (exactly 512 samples @ 16kHz or 256 @ 8kHz)
Response: {confidence: float, available: bool, latency_ms: float}
available=false (and confidence=0.0) when onnxruntime is not installed.
Empty body returns {confidence: 0.0, available: <bool>, latency_ms: 0.0}.
Also:
- Imports is_pipecat_vad_available + voice_confidence from vad.py (already in main)
- Listed in GET /capabilities endpoints manifest as 'vad_confidence'
- 12 tests covering happy path, unavailable path, empty body, sample_rate
param forwarding, capabilities manifest
Updates: fix mod3/http_api.py file header comment and endpoints dict.
The model __call__ returns shape (batch, frames). With batch_size=1 and a single frame, out[0] is shape (1,) — a 1D array. float() on a 1D array raises 'only 0-dimensional arrays can be converted to Python scalars', so voice_confidence() was silently returning 0.0 on every frame. VAD was effectively disabled despite reporting available=true. Fix: float(np.squeeze(new_confidence).item()) — works for any batch size and output shape. Root cause found via direct venv test after the HTTP endpoint showed 0.0 confidence on sine waves and noise. The error was swallowed by the bare except in voice_confidence().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
/v1/vadaccepts a WAV multipart upload and runs the full Silero torch path — right for utterance VAD, overkill for 512-sample frames arriving at 20ms intervals from a Discord SocketReader thread.The vendored pipecat
SileroVADAnalyzer(vendor/pipecat_vad/) uses ONNX, has no torch dependency, returns per-frame confidence in <5ms, and was already inmainviavoice_confidence()invad.py— but had no HTTP surface.What
available=false(confidence=0.0) when onnxruntime is not installed — graceful degradation{confidence: 0.0, available: <bool>, latency_ms: 0.0}(used as a health probe by callers)GET /capabilitiesasvad_confidenceCallers
The Discord voice adapter (
~/.hermes/hermes-agent/plugins/platforms/discord/adapter.py) is updated in parallel:_check_mod3_vad_available()now probesPOST /v1/vad/confidencewith empty body instead ofGET /health → modalities.vad(which was always false because the torch model isn't loaded at startup)_check_vad_bargein()now POSTs raw PCM bytes to/v1/vad/confidenceinstead of WAV-wrapping for/v1/vadTests
12 tests in
tests/test_vad_confidence_endpoint.py:vad_confidencekey present