VoxSpace

Evaluating whether self-supervised speech model embeddings capture perceptual categories in toddler vocalizations and adult speech.

Goal

Self-supervised speech models (HuBERT, WavLM, AVES-bio) learn acoustic representations without phonetic labels. We ask: do these representations distinguish perceptual categories in prelinguistic vocalizations? We extract embeddings from all transformer layers, measure category separability via silhouette scores and LOO k-NN classification, and compare performance across layers, models, datasets (toddler vs. adult), and label granularity (class vs. communicative context).

The key prediction is that layer-wise improvement in category separability should be strong for adult speech but weak or absent for toddler vocalizations, supporting the claim that semantic interpretation of toddler vocalizations depends on context beyond the acoustic signal alone.

Methods

Models: HuBERT (facebook/hubert-base-ls960), WavLM (microsoft/wavlm-base-plus), and AVES-bio (aves-base-bio, trained on animal vocalizations), all pretrained base models with no finetuning
Embeddings: Mean-pooled hidden states from all 13 layers (CNN + 12 transformer), 768D per layer
Silhouette score: Cosine distance on raw and PCA-50D embeddings, with bootstrap 95% CI
Classification: Leave-one-out k-NN (k=1,3,5) with cosine distance, bootstrap CI, Spearman trend test; both class-level (7 categories) and context-level (4 contexts) for toddler data
Visualization: UMAP 2D projections (hyperparameter sweep), colored by communicative context with class markers
Baseline: Mel spectrogram with time-shift invariant distance (Sainburg et al. 2020)

Setup

Requires pixi:

pixi install

Project structure

lib.py          Shared utilities: model loading, audio loading, embedding extraction
extract.py      Extract embeddings, save distance matrices and clustermaps
sweep.py        Sweep UMAP/PCA/raw silhouette scores across layers and hyperparameters
classify.py     LOO k-NN classification with confusion matrices and Spearman test
spectro.py      Spectrogram baseline: mel spectrogram + time-shift distance + evaluation
compare.py      Cross-dataset comparison plots (silhouette and accuracy vs layer)

Usage

1. Extract embeddings

pixi run python extract.py \
  --model hubert \
  --input-dir data/syllables \
  --categories-csv data/syllables/categories.csv \
  --output-dir results/toddler

Repeat with --model wavlm and --model aves. For a second dataset, use a different --output-dir (e.g. results/adult).

2. Silhouette sweep

pixi run python sweep.py --model hubert --output-dir results/toddler --categories ABA AYA Eeah IMA

Produces a summary CSV and best/worst UMAP scatter plot per model.

3. Classification

pixi run python classify.py --model hubert --output-dir results/toddler --categories ABA AYA Eeah IMA

Produces accuracy CSV (layer x k x method), confusion matrix for the best configuration, and Spearman correlation test. For toddler data, automatically runs both class-level (7 categories) and context-level (4 communicative contexts) classification.

4. Spectrogram baseline

pixi run python spectro.py \
  --input-dir data/syllables \
  --categories-csv data/syllables/categories.csv \
  --output-dir results/toddler \
  --categories ABA AYA Eeah IMA

Sweeps mel spectrogram parameters (n_mels, fmax, hop_length, n_fft), computes pairwise time-shift invariant distances, and evaluates silhouette score and LOO k-NN accuracy. Results appear as horizontal baseline lines on comparison plots.

5. Cross-dataset comparison

pixi run python compare.py \
  --dataset-dirs Toddler:results/toddler Adult:results/adult \
  --categories ABA AYA Eeah IMA

Produces PDFs per (model, metric, label type):

{model}_sil_vs_layer.pdf / {model}_sil_vs_layer_context.pdf — silhouette score vs layer
{model}_acc_vs_layer.pdf / {model}_acc_vs_layer_context.pdf — LOO k-NN accuracy vs layer

Each plot shows solid lines (raw) and dashed lines (+PCA) per dataset, with bootstrap CI on the raw lines. Context-level plots use communicative context labels for toddler data (class = context for adult data).

Categories CSV format

filename,category
ABA_01.wav,ABA
IMA_02.wav,IMA

Models

Name	ID	Architecture	Training data
`hubert`	`facebook/hubert-base-ls960`	12-layer transformer	LibriSpeech 960h
`wavlm`	`microsoft/wavlm-base-plus`	12-layer transformer	94k hours mixed speech
`aves`	`aves-base-bio`	12-layer transformer	~360h animal vocalizations

All are pretrained base models (no finetuning, no classification head). AVES-bio is loaded via torchaudio's wav2vec2_model (not the esp-aves package).

Output structure

results/
  toddler/                          # or adult/, etc.
    hubert/
      embeddings_layer{0-12}.npy    # 768D embeddings per layer
      categories.npy                # category labels
      hubert_sweep_summary_*.csv    # silhouette sweep results
      hubert_best_worst_*.pdf       # best/worst UMAP scatter plots
      hubert_knn_accuracy_*.csv     # k-NN accuracy table
      hubert_knn_spearman_*.csv     # Spearman trend test
      hubert_knn_best_confusion_*.pdf
    wavlm/
      ...
    aves/
      ...
    spectro/
      spectro_sweep_*.csv             # spectrogram parameter sweep results
      spectro_best_clustermap_*.pdf   # clustermap for best parameters
  {model}_sil_vs_layer.pdf           # cross-dataset comparison plots (class-level)
  {model}_sil_vs_layer_context.pdf   # cross-dataset comparison plots (context-level)
  {model}_acc_vs_layer.pdf
  {model}_acc_vs_layer_context.pdf

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
classify.py		classify.py
compare.py		compare.py
extract.py		extract.py
lib.py		lib.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml
prompt.md		prompt.md
spectro.py		spectro.py
sweep.py		sweep.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxSpace

Goal

Methods

Setup

Project structure

Usage

1. Extract embeddings

2. Silhouette sweep

3. Classification

4. Spectrogram baseline

5. Cross-dataset comparison

Categories CSV format

Models

Output structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoxSpace

Goal

Methods

Setup

Project structure

Usage

1. Extract embeddings

2. Silhouette sweep

3. Classification

4. Spectrogram baseline

5. Cross-dataset comparison

Categories CSV format

Models

Output structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages