opus-sm is a Rust CLI for speech-vs-music analysis on Parquet datasets with embedded audio.
It uses the built-in speech/music discriminator from the open source Opus codec to compute
framewise music probabilities, emit speech/music segmentation, strip detected music regions,
and split Parquet rows into speech and music outputs.
- reads Parquet files containing embedded audio rows
- supports embedded audio stored as:
- WAV bytes
- Ogg Opus bytes
- MP3 bytes
- runs the Opus speech/music discriminator frame by frame
- prints framewise music probabilities
- supports smoothed probabilities before segmentation
- supports hysteresis segmentation with separate high/low thresholds
- supports minimum run lengths for speech/music segments
- can remove music regions from each row and write updated Parquet output
- can split rows into separate speech and music Parquet files
- parallelizes row-level work within each Parquet batch using Rayon
The tool follows the same Arrow/Parquet access pattern as opusify and expects an audio
struct column containing at least:
audio.bytesaudio.path
Other fields in the schema are preserved when writing output.
For strip-music, the output Parquet keeps the same schema and rewrites only:
audio.bytesaudio.path
For separate-sm, the speech and music output files keep the same schema and row structure as
the input, with rows routed by the configured row-decision rule.
Requirements:
- Rust toolchain
- C toolchain for building the small Opus analysis shim
Build:
cargo build --releaseTop-level help:
cargo run -- --helpPrint framewise probabilities for every audio row.
cargo run -- analyze \
--input data/input.parquetDirectory input:
cargo run -- analyze \
--input data/parquet-dir \
--smooth-window 5Output format:
FRAME <file> <row> <frame_index> <start_seconds> <end_seconds> <music_probability> <activity_probability>
Print framewise probabilities and thresholded speech/music segments.
cargo run -- segment \
--input data/parquet-dir \
--threshold 0.6 \
--smooth-window 5 \
--low-threshold 0.45 \
--high-threshold 0.65 \
--min-music-frames 3 \
--min-speech-frames 3Additional output format:
SEGMENT <speech|music> <start_frame> <end_frame> <start_seconds> <end_seconds>
Remove music segments from each row and write a new Parquet file or tree with the same schema.
cargo run -- strip-music \
--input data/parquet-dir \
--output data/speech-only \
--threshold 0.6 \
--low-threshold 0.45 \
--high-threshold 0.65 \
--smooth-window 5 \
--fade-ms 15Behavior:
- rows are preserved
- only detected music spans are removed from the audio payload
- optional fades can be applied at strip boundaries to reduce clicks
- output audio is written back in the same container family as the input:
- WAV in, WAV out
- Ogg Opus in, Ogg Opus out
- MP3 in, WAV out
- WAV output preserves the original WAV sample format and bit depth when supported
Segment each row into speech-only chunks and emit one output row per chunk.
cargo run -- vad \
--input data/parquet-dir \
--output data/vad-chunks \
--threshold 0.8 \
--chunk-format ogg-opus \
--min-speech-seconds 0.5 \
--max-speech-seconds 30.0Behavior:
- one output row is written for each detected speech chunk
audio.bytesis replaced with the chunk audio payloadaudio.pathis rewritten as..._chunkN.ext- chunk
durationis recomputed when a top-leveldurationcolumn exists - top-level
transcriptionis replaced with"-"when that column exists --chunk-format autopreserves the current default behavior:- WAV input chunks stay WAV
- Ogg Opus input chunks stay Ogg Opus
- MP3 input chunks are written as WAV
--chunk-format wavforces WAV chunk output--chunk-format ogg-opusforces Ogg Opus chunk output
Split rows into two Parquet outputs based on a configurable row-level decision rule.
Max-score routing:
cargo run -- separate-sm \
--input data/parquet-dir \
--speech-output data/speech \
--music-output data/music \
--threshold 0.6 \
--row-decision maxFraction-based routing:
cargo run -- separate-sm \
--input data/parquet-dir \
--speech-output data/speech \
--music-output data/music \
--threshold 0.6 \
--row-decision fraction \
--row-fraction 0.25Supported row-decision modes:
maxmeanmedianfraction
For fraction, the row is classified as music when the fraction of frames with
music_probability >= threshold is at least row-fraction.
opus-sm can also be used as a Rust library.
Main exported convenience functions:
decodeanalyze_bytesanalyze_decodedsegment_bytesclassify_bytesmusic_score_bytesstrip_music_bytesvad_chunks_bytes
Main exported analysis types:
AnalysisOptionsDecisionOptionsFrameProbabilitySegmentSegmentKindRowDecisionModeDecodedAudio
Minimal example:
use anyhow::Result;
use opus_sm::{AnalysisOptions, DecisionOptions, RowDecisionMode, analyze_bytes, segment_bytes};
fn run(audio_bytes: &[u8]) -> Result<()> {
let analysis = analyze_bytes(audio_bytes, AnalysisOptions { smooth_window: 5 })?;
let segmented = segment_bytes(
audio_bytes,
AnalysisOptions { smooth_window: 5 },
DecisionOptions {
threshold: 0.6,
low_threshold: 0.45,
high_threshold: 0.65,
min_music_frames: 3,
min_speech_frames: 3,
row_decision: RowDecisionMode::Max,
row_fraction: 0.5,
},
)?;
println!("frames: {}", analysis.probabilities.len());
println!("segments: {}", segmented.segments.len());
Ok(())
}--input <PATH>: input Parquet file or directory--batch-size <N>: Arrow batch size for the Parquet reader, default256
--smooth-window <N>: moving-average smoothing window over frame probabilities, default1
--threshold <FLOAT>: base probability threshold in[0.0, 1.0]--high-threshold <FLOAT>: hysteresis switch-to-music threshold--low-threshold <FLOAT>: hysteresis switch-to-speech threshold--min-music-frames <N>: suppress short music runs shorter thanNframes--min-speech-frames <N>: suppress short speech runs shorter thanNframes--row-decision <MODE>: row-level classifier forseparate-sm, one ofmax,mean,median,fraction--row-fraction <FLOAT>: fraction cutoff forrow-decision=fraction, default0.5
--output <PATH>: output file or root directory--fade-ms <FLOAT>: fade-in/fade-out duration applied at strip boundaries
--speech-output <PATH>: destination file or root directory for speech rows--music-output <PATH>: destination file or root directory for music rows
--output <PATH>: output file or root directory for chunk rows--fade-ms <FLOAT>: fade-in/fade-out duration applied at chunk boundaries--min-speech-seconds <FLOAT>: drop detected speech chunks shorter than this--max-speech-seconds <FLOAT>: split long speech chunks to at most this duration--chunk-format <FORMAT>: chunk output container, one ofauto,wav,ogg-opus
- frame analysis uses Opus’s internal
music_probanalysis path via a small C shim - analysis runs on 20 ms frames at 48 kHz
- non-48 kHz input is resampled before analysis
- multichannel inputs above stereo are downmixed to mono for analysis
- row-level decode/analyze/transform work is parallelized per batch using Rayon
- output ordering remains deterministic and follows original row order
- Ogg Opus decode uses
pre_skipand also trims the decoded tail using the final granule position - Ogg Opus output writes final-page granule positions based on the true output sample count
- MP3 decode uses
symphonia - MP3 input is analyzed directly, but rewritten audio is emitted as WAV or Ogg Opus because the tool does not encode MP3 output
- WAV output preserves the original WAV container format where supported:
- 32-bit float
- 8-bit PCM
- 16-bit PCM
- 24-bit PCM
- 32-bit PCM
- Ogg Opus decode/encode currently supports mono and stereo only
- audio above stereo is analyzed by mono downmix rather than per-channel classification
- MP3 input is supported for decoding only; rewritten MP3 rows are emitted as WAV
Checked with:
cargo check
cargo run -- --help
cargo run -- segment --help
cargo run -- strip-music --help
cargo test vad_accepts_mp3_rows_from_radio_free_dataset -- --nocapture