Skip to content

feat(vad): add POST /v1/vad/confidence — per-packet ONNX confidence endpoint#128

Open
chazmaniandinkle wants to merge 2 commits into
mainfrom
feat/vad-confidence-endpoint
Open

feat(vad): add POST /v1/vad/confidence — per-packet ONNX confidence endpoint#128
chazmaniandinkle wants to merge 2 commits into
mainfrom
feat/vad-confidence-endpoint

Conversation

@chazmaniandinkle
Copy link
Copy Markdown
Contributor

Why

/v1/vad accepts a WAV multipart upload and runs the full Silero torch path — right for utterance VAD, overkill for 512-sample frames arriving at 20ms intervals from a Discord SocketReader thread.

The vendored pipecat SileroVADAnalyzer (vendor/pipecat_vad/) uses ONNX, has no torch dependency, returns per-frame confidence in <5ms, and was already in main via voice_confidence() in vad.py — but had no HTTP surface.

What

POST /v1/vad/confidence[?sample_rate=16000]
Body:     raw int16 little-endian PCM (512 samples @ 16kHz or 256 @ 8kHz)
Response: {"confidence": float, "available": bool, "latency_ms": float}
  • available=false (confidence=0.0) when onnxruntime is not installed — graceful degradation
  • Empty body → {confidence: 0.0, available: <bool>, latency_ms: 0.0} (used as a health probe by callers)
  • Listed in GET /capabilities as vad_confidence

Callers

The Discord voice adapter (~/.hermes/hermes-agent/plugins/platforms/discord/adapter.py) is updated in parallel:

  • _check_mod3_vad_available() now probes POST /v1/vad/confidence with empty body instead of GET /health → modalities.vad (which was always false because the torch model isn't loaded at startup)
  • _check_vad_bargein() now POSTs raw PCM bytes to /v1/vad/confidence instead of WAV-wrapping for /v1/vad

Tests

12 tests in tests/test_vad_confidence_endpoint.py:

  • Happy path: confidence value, available=true, latency_ms present, sample_rate param forwarded
  • Unavailable path: available=false, confidence=0.0, empty body
  • Capabilities manifest: vad_confidence key present

…ndpoint

Adds a lightweight per-packet VAD endpoint for barge-in loops (Discord VC,
etc.) that do not need full utterance-level speech detection.

Why:
- /v1/vad accepts a WAV multipart upload and runs the full Silero torch path.
  That's the right API for utterance VAD but overkill for 512-sample frames
  arriving at 20ms intervals from a Discord SocketReader thread.
- The vendored pipecat SileroVADAnalyzer (vendor/pipecat_vad/) uses ONNX,
  has no torch dependency, returns per-frame confidence in <5ms, and was
  already present in main but had no HTTP surface.

Contract:
  POST /v1/vad/confidence[?sample_rate=16000]
  Body: raw int16 little-endian PCM (exactly 512 samples @ 16kHz or 256 @ 8kHz)
  Response: {confidence: float, available: bool, latency_ms: float}

  available=false (and confidence=0.0) when onnxruntime is not installed.
  Empty body returns {confidence: 0.0, available: <bool>, latency_ms: 0.0}.

Also:
- Imports is_pipecat_vad_available + voice_confidence from vad.py (already in main)
- Listed in GET /capabilities endpoints manifest as 'vad_confidence'
- 12 tests covering happy path, unavailable path, empty body, sample_rate
  param forwarding, capabilities manifest

Updates: fix mod3/http_api.py file header comment and endpoints dict.
The model __call__ returns shape (batch, frames). With batch_size=1
and a single frame, out[0] is shape (1,) — a 1D array. float() on a
1D array raises 'only 0-dimensional arrays can be converted to Python
scalars', so voice_confidence() was silently returning 0.0 on every
frame. VAD was effectively disabled despite reporting available=true.

Fix: float(np.squeeze(new_confidence).item()) — works for any batch
size and output shape.

Root cause found via direct venv test after the HTTP endpoint showed
0.0 confidence on sine waves and noise. The error was swallowed by the
bare except in voice_confidence().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant