Skip to content

Commit 1e1761b

Browse files
authored
Streaming VAD and Speech Segmentation (#110)
### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```
1 parent 86522ab commit 1e1761b

15 files changed

Lines changed: 1286 additions & 312 deletions

Documentation/CLI.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,13 +42,22 @@ swift run fluidaudio diarization-benchmark --dataset ami-sdm \
4242
## VAD
4343

4444
```bash
45+
# Offline segmentation with seconds output (default mode)
46+
swift run fluidaudio vad-analyze path/to/audio.wav
47+
48+
# Streaming only with 128 ms chunks and a custom threshold (timestamps emitted in seconds)
49+
swift run fluidaudio vad-analyze path/to/audio.wav --streaming --threshold 0.65 --min-silence-ms 400
50+
4551
# Run VAD benchmark (mini50 dataset by default)
4652
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3
4753

48-
# Save results and enable debug output
54+
# Save benchmark results and enable debug output
4955
swift run fluidaudio vad-benchmark --all-files --output vad_results.json --debug
5056
```
5157

58+
`swift run fluidaudio vad-analyze --help` lists every tuning option (padding,
59+
negative threshold overrides, max-duration splitting, etc.).
60+
5261
## Datasets
5362

5463
```bash
@@ -58,4 +67,3 @@ swift run fluidaudio download --dataset librispeech-test-other
5867
swift run fluidaudio download --dataset ami-sdm
5968
swift run fluidaudio download --dataset vad
6069
```
61-

Documentation/Guides/MCP.md

Lines changed: 0 additions & 24 deletions
This file was deleted.
Lines changed: 153 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,178 @@
11
# Voice Activity Detection (VAD)
22

3-
The current VAD APIs require careful tuning for your specific use case. If you need help integrating VAD, reach out in our Discord channel.
3+
Fluid Audio ships the Silero VAD converted for Core ML together with Silero-style
4+
timestamp extraction and streaming hysteresis. If you need help tuning the
5+
parameters for your use case, reach out on Discord.
46

5-
Our goal is to provide a streamlined API similar to Apple's upcoming SpeechDetector in [OS26](https://developer.apple.com/documentation/speech/speechdetector).
7+
## Quick Start
68

7-
## Quick Start (Code)
9+
Need chunk-level probabilities or state for custom pipelines? Call `process(_:)`
10+
to inspect every 256 ms hop:
11+
12+
```swift
13+
let results = try await manager.process(samples)
14+
for (index, chunk) in results.enumerated() {
15+
print(
16+
String(
17+
format: "Chunk %02d: prob=%.3f, inference=%.4fs",
18+
index,
19+
chunk.probability,
20+
chunk.processingTime
21+
)
22+
)
23+
}
24+
```
25+
26+
## Offline Segmentation (Code)
27+
28+
`VadManager` can now emit ready-to-use speech intervals directly from PCM
29+
samples. The segmentation logic mirrors the Silero reference implementation,
30+
including minimum speech duration, silence padding, and max-duration splitting.
831

932
```swift
1033
import FluidAudio
1134

12-
// Programmatic VAD over an audio file
1335
Task {
14-
// 1) Initialize VAD (async loads Silero model)
15-
let vad = try await VadManager(
16-
config: VadConfig(threshold: 0.85, debugMode: false)
36+
let manager = try await VadManager(
37+
config: VadConfig(threshold: 0.75)
38+
)
39+
40+
// Convert any supported file to 16 kHz mono Float32
41+
let audioURL = URL(fileURLWithPath: "path/to/audio.wav")
42+
let samples = try AudioConverter().resampleAudioFile(audioURL)
43+
44+
// Tune segmentation behavior with VadSegmentationConfig
45+
var segmentation = VadSegmentationConfig.default
46+
segmentation.minSpeechDuration = 0.25
47+
segmentation.minSilenceDuration = 0.4
48+
segmentation.speechPadding = 0.12
49+
50+
let segments = try await manager.segmentSpeech(samples, config: segmentation)
51+
for (index, segment) in segments.enumerated() {
52+
print(String(
53+
format: "Segment %02d: %.2f–%.2fs",
54+
index + 1,
55+
segment.startTime,
56+
segment.endTime
57+
))
58+
}
59+
60+
// Need audio chunks instead of timestamps?
61+
let clips = try await manager.segmentSpeechAudio(samples, config: segmentation)
62+
print("Extracted \(clips.count) buffered segments ready for ASR")
63+
}
64+
```
65+
66+
Need chunk-level probabilities for each 256 ms hop? Use `process(_:)` and inspect
67+
`VadResult` directly:
68+
69+
```swift
70+
let results = try await manager.process(samples)
71+
for (index, chunk) in results.enumerated() {
72+
print(
73+
String(
74+
format: "Chunk %02d: prob=%.3f, inference=%.4fs",
75+
index,
76+
chunk.probability,
77+
chunk.processingTime
78+
)
1779
)
80+
}
81+
```
82+
83+
Key knobs in `VadSegmentationConfig`:
84+
- `minSpeechDuration`: discard very short bursts.
85+
- `minSilenceDuration`: silence length required to close a segment.
86+
- `maxSpeechDuration`: automatically split long spans using the last detected silence (default 14 s).
87+
- `speechPadding`: context added on both sides of each returned segment.
88+
- `negativeThreshold`/`negativeThresholdOffset`: control hysteresis the same way as Silero's `threshold`/`neg_threshold`.
89+
90+
### Measuring Offline RTF
1891

19-
// 2) Process any supported file; conversion to 16 kHz mono is automatic
20-
let url = URL(fileURLWithPath: "path/to/audio.wav")
21-
let results = try await vad.process(url)
22-
23-
// 3) Convert per-frame decisions into segments (512-sample frames @ 16 kHz)
24-
let sampleRate = 16000.0
25-
let frame = 512.0
26-
27-
var startIndex: Int? = nil
28-
for (i, r) in results.enumerated() {
29-
if r.isVoiceActive {
30-
startIndex = startIndex ?? i
31-
} else if let s = startIndex {
32-
let startSec = (Double(s) * frame) / sampleRate
33-
let endSec = (Double(i + 1) * frame) / sampleRate
34-
print(String(format: "Speech: %.2f–%.2fs", startSec, endSec))
35-
startIndex = nil
92+
If you prefer to keep the per-chunk `VadResult` output, you can measure the
93+
real-time factor (RTFx) of non-streaming runs by comparing total inference time
94+
with the audio duration:
95+
96+
```swift
97+
let results = try await manager.process(samples)
98+
let totalInference = results.reduce(0.0) { $0 + $1.processingTime }
99+
let audioSeconds = Double(samples.count) / Double(VadManager.sampleRate)
100+
let rtf = audioSeconds / totalInference
101+
print(String(format: "VAD RTFx: %.1f", rtf))
102+
```
103+
104+
`VadResult.processingTime` is reported per 4096-sample chunk, so summing across
105+
the array yields the full pass latency.
106+
107+
## Streaming API
108+
109+
For streaming workloads you control the chunk size and maintain a
110+
`VadStreamState`. Each call emits at most one `VadStreamEvent` describing a
111+
speech start or end boundary.
112+
113+
```swift
114+
import FluidAudio
115+
116+
Task {
117+
let manager = try await VadManager()
118+
var state = await manager.makeStreamState()
119+
120+
for chunk in microphoneChunks { // chunk length ~256 ms at 16 kHz
121+
let result = try await manager.processStreamingChunk(
122+
chunk,
123+
state: state,
124+
config: .default,
125+
returnSeconds: true,
126+
timeResolution: 2
127+
)
128+
129+
state = result.state
130+
if let event = result.event {
131+
switch event.kind {
132+
case .speechStart:
133+
print("Speech began at \(event.time ?? 0) s")
134+
case .speechEnd:
135+
print("Speech ended at \(event.time ?? 0) s")
136+
}
36137
}
37138
}
38139
}
39140
```
40141

41142
Notes:
42-
- You can also call `process(_ buffer: AVAudioPCMBuffer)` or `process(_ samples: [Float])`.
43-
- Frame size is `512` samples (32 ms at 16 kHz). Threshold defaults to `0.85`.
143+
- Stream chunks do not need to be exactly 4096 samples; choose what matches your input cadence.
144+
- Call `makeStreamState()` whenever you reset your audio stream (equivalent to Silero's `reset_states`).
145+
- When requesting seconds (`returnSeconds: true`), timestamps are rounded using `timeResolution` decimal places.
44146

45147
## CLI
46148

149+
Start with the general-purpose `process` command, which runs the diarization
150+
pipeline (and therefore VAD) end-to-end on a single file:
151+
47152
```bash
48-
# Run VAD benchmark (mini50 dataset by default)
49-
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3
153+
swift run fluidaudio process path/to/audio.wav
154+
```
50155

51-
# Save results and enable debug output
52-
swift run fluidaudio vad-benchmark --all-files --output vad_results.json --debug
156+
Once you need to experiment with the VAD-specific heuristics directly, use the
157+
CLI commands below:
53158

54-
# VOiCES subset mixed-condition benchmark (high-precision setting)
55-
swift run fluidaudio vad-benchmark --dataset voices-subset --all-files --threshold 0.85
159+
```bash
160+
# Inspect offline segments (default mode is offline only)
161+
swift run fluidaudio vad-analyze path/to/audio.wav
56162

57-
# Download VAD dataset if needed
58-
swift run fluidaudio download --dataset vad
163+
# Streaming only, 128 ms chunks, tighter silence rules (timestamps are emitted in seconds)
164+
swift run fluidaudio vad-analyze path/to/audio.wav --streaming --min-silence-ms 300
165+
166+
# Run both offline + streaming in one pass
167+
swift run fluidaudio vad-analyze path/to/audio.wav --mode both
168+
169+
# Classic benchmark tooling remains available
170+
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3
59171
```
172+
173+
`swift run fluidaudio vad-analyze --help` prints the full list of tuning
174+
options, including negative-threshold overrides and max-duration splitting.
175+
Offline runs emit an RTFx summary calculated from per-chunk inference time. Use
176+
`--mode both` if you also want to see streaming start/end events in the same run.
177+
178+
Datasets for benchmarking can be fetched with `swift run fluidaudio download --dataset vad`.

Documentation/VAD/Segmentation.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Configuration fields
2+
3+
Configuration for turning raw VAD probabilities into stable speech segments.
4+
5+
This struct applies rules for minimum durations, thresholds, and hysteresis to avoid jittery cuts and to produce clean, ASR-ready segments.
6+
7+
```swift
8+
public struct VadSegmentationConfig: Sendable {
9+
/// Minimum length of detected speech to keep as a segment (default: 0.15s).
10+
/// Prevents clicks or coughs from being treated as speech.
11+
public var minSpeechDuration: TimeInterval
12+
13+
/// Minimum silence required to end a segment (default: 0.75s).
14+
/// Prevents early cut-offs when a speaker pauses briefly.
15+
public var minSilenceDuration: TimeInterval
16+
17+
/// Maximum length of a single speech segment (default: 14s).
18+
/// Segments longer than this will be forcibly split to match ASR model limits.
19+
public var maxSpeechDuration: TimeInterval
20+
21+
/// Extra padding (before and after) each detected speech segment (default: 0.1s).
22+
/// Keeps context around words so they aren’t clipped.
23+
public var speechPadding: TimeInterval
24+
25+
/// Probability threshold below which audio is treated as silence (default: 0.3).
26+
/// Lower = stricter silence detection, higher = more tolerant.
27+
public var silenceThresholdForSplit: Float
28+
29+
/// Explicit override for the *exit* hysteresis threshold (default: nil).
30+
/// If not set, the system computes it automatically from the base threshold minus `negativeThresholdOffset`.
31+
public var negativeThreshold: Float?
32+
33+
/// How far below the base threshold the *exit* threshold should be (default: 0.15).
34+
/// Example: if entry = 0.5, exit = 0.35. Prevents rapid flipping on noisy inputs.
35+
public var negativeThresholdOffset: Float
36+
37+
/// Minimum silence enforced when splitting a max-length segment (default: 0.098s).
38+
/// Ensures forced splits don’t land mid-phoneme.
39+
public var minSilenceAtMaxSpeech: TimeInterval
40+
41+
/// If true, try to split at the longest silence near the max duration cutoff.
42+
/// Produces cleaner segment boundaries compared to a hard cut.
43+
public var useMaxPossibleSilenceAtMaxSpeech: Bool
44+
45+
public static let `default` = VadSegmentationConfig()
46+
47+
public init(
48+
minSpeechDuration: TimeInterval = 0.15,
49+
minSilenceDuration: TimeInterval = 0.75,
50+
maxSpeechDuration: TimeInterval = 14.0,
51+
speechPadding: TimeInterval = 0.1,
52+
silenceThresholdForSplit: Float = 0.3,
53+
negativeThreshold: Float? = nil,
54+
negativeThresholdOffset: Float = 0.15,
55+
minSilenceAtMaxSpeech: TimeInterval = 0.098,
56+
useMaxPossibleSilenceAtMaxSpeech: Bool = true
57+
) {
58+
self.minSpeechDuration = minSpeechDuration
59+
self.minSilenceDuration = minSilenceDuration
60+
self.maxSpeechDuration = maxSpeechDuration
61+
self.speechPadding = speechPadding
62+
self.silenceThresholdForSplit = silenceThresholdForSplit
63+
self.negativeThreshold = negativeThreshold
64+
self.negativeThresholdOffset = negativeThresholdOffset
65+
self.minSilenceAtMaxSpeech = minSilenceAtMaxSpeech
66+
self.useMaxPossibleSilenceAtMaxSpeech = useMaxPossibleSilenceAtMaxSpeech
67+
}
68+
69+
/// Computes the working negative threshold for hysteresis:
70+
/// - If `negativeThreshold` is set, that value is used.
71+
/// - Otherwise, it is computed as (baseThreshold – negativeThresholdOffset).
72+
/// - This creates a "sticky zone" between thresholds:
73+
/// - Enter speech when prob > baseThreshold
74+
/// - Exit speech when prob < negativeThreshold
75+
/// - Stay in current state in between
76+
public func effectiveNegativeThreshold(baseThreshold: Float) -> Float {
77+
if let override = negativeThreshold {
78+
return override
79+
}
80+
return max(baseThreshold - negativeThresholdOffset, 0.01)
81+
}
82+
}
83+
```

0 commit comments

Comments
 (0)