Skip to content

yoavram-lab/voxspace

Repository files navigation

VoxSpace

Evaluating whether self-supervised speech model embeddings capture perceptual categories in toddler vocalizations and adult speech.

Goal

Self-supervised speech models (HuBERT, WavLM, AVES-bio) learn acoustic representations without phonetic labels. We ask: do these representations distinguish perceptual categories in prelinguistic vocalizations? We extract embeddings from all transformer layers, measure category separability via silhouette scores and LOO k-NN classification, and compare performance across layers, models, datasets (toddler vs. adult), and label granularity (class vs. communicative context).

The key prediction is that layer-wise improvement in category separability should be strong for adult speech but weak or absent for toddler vocalizations, supporting the claim that semantic interpretation of toddler vocalizations depends on context beyond the acoustic signal alone.

Methods

  • Models: HuBERT (facebook/hubert-base-ls960), WavLM (microsoft/wavlm-base-plus), and AVES-bio (aves-base-bio, trained on animal vocalizations), all pretrained base models with no finetuning
  • Embeddings: Mean-pooled hidden states from all 13 layers (CNN + 12 transformer), 768D per layer
  • Silhouette score: Cosine distance on raw and PCA-50D embeddings, with bootstrap 95% CI
  • Classification: Leave-one-out k-NN (k=1,3,5) with cosine distance, bootstrap CI, Spearman trend test; both class-level (7 categories) and context-level (4 contexts) for toddler data
  • Visualization: UMAP 2D projections (hyperparameter sweep), colored by communicative context with class markers
  • Baseline: Mel spectrogram with time-shift invariant distance (Sainburg et al. 2020)

Setup

Requires pixi:

pixi install

Project structure

lib.py          Shared utilities: model loading, audio loading, embedding extraction
extract.py      Extract embeddings, save distance matrices and clustermaps
sweep.py        Sweep UMAP/PCA/raw silhouette scores across layers and hyperparameters
classify.py     LOO k-NN classification with confusion matrices and Spearman test
spectro.py      Spectrogram baseline: mel spectrogram + time-shift distance + evaluation
compare.py      Cross-dataset comparison plots (silhouette and accuracy vs layer)

Usage

1. Extract embeddings

pixi run python extract.py \
  --model hubert \
  --input-dir data/syllables \
  --categories-csv data/syllables/categories.csv \
  --output-dir results/toddler

Repeat with --model wavlm and --model aves. For a second dataset, use a different --output-dir (e.g. results/adult).

2. Silhouette sweep

pixi run python sweep.py --model hubert --output-dir results/toddler --categories ABA AYA Eeah IMA

Produces a summary CSV and best/worst UMAP scatter plot per model.

3. Classification

pixi run python classify.py --model hubert --output-dir results/toddler --categories ABA AYA Eeah IMA

Produces accuracy CSV (layer x k x method), confusion matrix for the best configuration, and Spearman correlation test. For toddler data, automatically runs both class-level (7 categories) and context-level (4 communicative contexts) classification.

4. Spectrogram baseline

pixi run python spectro.py \
  --input-dir data/syllables \
  --categories-csv data/syllables/categories.csv \
  --output-dir results/toddler \
  --categories ABA AYA Eeah IMA

Sweeps mel spectrogram parameters (n_mels, fmax, hop_length, n_fft), computes pairwise time-shift invariant distances, and evaluates silhouette score and LOO k-NN accuracy. Results appear as horizontal baseline lines on comparison plots.

5. Cross-dataset comparison

pixi run python compare.py \
  --dataset-dirs Toddler:results/toddler Adult:results/adult \
  --categories ABA AYA Eeah IMA

Produces PDFs per (model, metric, label type):

  • {model}_sil_vs_layer.pdf / {model}_sil_vs_layer_context.pdf — silhouette score vs layer
  • {model}_acc_vs_layer.pdf / {model}_acc_vs_layer_context.pdf — LOO k-NN accuracy vs layer

Each plot shows solid lines (raw) and dashed lines (+PCA) per dataset, with bootstrap CI on the raw lines. Context-level plots use communicative context labels for toddler data (class = context for adult data).

Categories CSV format

filename,category
ABA_01.wav,ABA
IMA_02.wav,IMA

Models

Name ID Architecture Training data
hubert facebook/hubert-base-ls960 12-layer transformer LibriSpeech 960h
wavlm microsoft/wavlm-base-plus 12-layer transformer 94k hours mixed speech
aves aves-base-bio 12-layer transformer ~360h animal vocalizations

All are pretrained base models (no finetuning, no classification head). AVES-bio is loaded via torchaudio's wav2vec2_model (not the esp-aves package).

Output structure

results/
  toddler/                          # or adult/, etc.
    hubert/
      embeddings_layer{0-12}.npy    # 768D embeddings per layer
      categories.npy                # category labels
      hubert_sweep_summary_*.csv    # silhouette sweep results
      hubert_best_worst_*.pdf       # best/worst UMAP scatter plots
      hubert_knn_accuracy_*.csv     # k-NN accuracy table
      hubert_knn_spearman_*.csv     # Spearman trend test
      hubert_knn_best_confusion_*.pdf
    wavlm/
      ...
    aves/
      ...
    spectro/
      spectro_sweep_*.csv             # spectrogram parameter sweep results
      spectro_best_clustermap_*.pdf   # clustermap for best parameters
  {model}_sil_vs_layer.pdf           # cross-dataset comparison plots (class-level)
  {model}_sil_vs_layer_context.pdf   # cross-dataset comparison plots (context-level)
  {model}_acc_vs_layer.pdf
  {model}_acc_vs_layer_context.pdf

License

MIT

About

Comparing acoustic and perceptual distances in vocalizations using self-supervised vocal embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages