Streaming Voice Activity Detection over WebSocket using FireRedVAD ONNX models. Includes optional Audio Event Detection (AED) to classify speech segments as speech, music, or noise.
- Python 3.10+
- ONNX model files in
onnx_models/:fireredvad_stream_vad_with_cache.onnx— streaming VAD modelcmvn.ark— CMVN normalization statsfireredvad_aed.onnx— audio event detection model (optional)
uv syncThe server accepts streaming 16kHz 16-bit mono PCM audio over WebSocket, runs VAD to detect speech segments, and optionally classifies each segment using the AED model.
python server.py| Flag | Default | Description |
|---|---|---|
--host |
0.0.0.0 |
Bind address |
--port |
8765 |
WebSocket port |
--model |
onnx_models/fireredvad_stream_vad_with_cache.onnx |
VAD model path |
--cmvn |
onnx_models/cmvn.ark |
CMVN stats path |
--aed-model |
onnx_models/fireredvad_aed.onnx |
AED model path (skipped if not found) |
--output-dir |
vad_output |
Directory for saved audio segments |
Client sends:
- Binary messages: raw int16 little-endian PCM audio at 16kHz
- JSON messages:
{"action": "reset"}to reset VAD state
Server sends:
Speech start:
{"event": "speech_start", "time": 1.234}Speech end (with AED when enabled):
{
"event": "speech_end",
"start": 1.234,
"end": 3.456,
"file": "vad_output/session_.../segment_0001_1.23s_3.46s.wav",
"aed_label": "speech",
"aed_probs": {"speech": 0.95, "music": 0.03, "noise": 0.02}
}The client streams audio to the server and prints VAD events.
python client.py --file audio.wavThe file must be 16kHz 16-bit mono WAV. Convert with ffmpeg if needed:
ffmpeg -i input.wav -ar 16000 -ac 1 -acodec pcm_s16le audio.wavpython client.py --micPress Ctrl+C to stop.
| Flag | Default | Description |
|---|---|---|
--uri |
ws://localhost:8765 |
WebSocket server URI |
--file |
WAV file to stream | |
--mic |
Stream from microphone |