Evaluating whether self-supervised speech model embeddings capture perceptual categories in toddler vocalizations and adult speech.
Self-supervised speech models (HuBERT, WavLM, AVES-bio) learn acoustic representations without phonetic labels. We ask: do these representations distinguish perceptual categories in prelinguistic vocalizations? We extract embeddings from all transformer layers, measure category separability via silhouette scores and LOO k-NN classification, and compare performance across layers, models, datasets (toddler vs. adult), and label granularity (class vs. communicative context).
The key prediction is that layer-wise improvement in category separability should be strong for adult speech but weak or absent for toddler vocalizations, supporting the claim that semantic interpretation of toddler vocalizations depends on context beyond the acoustic signal alone.
- Models: HuBERT (
facebook/hubert-base-ls960), WavLM (microsoft/wavlm-base-plus), and AVES-bio (aves-base-bio, trained on animal vocalizations), all pretrained base models with no finetuning - Embeddings: Mean-pooled hidden states from all 13 layers (CNN + 12 transformer), 768D per layer
- Silhouette score: Cosine distance on raw and PCA-50D embeddings, with bootstrap 95% CI
- Classification: Leave-one-out k-NN (k=1,3,5) with cosine distance, bootstrap CI, Spearman trend test; both class-level (7 categories) and context-level (4 contexts) for toddler data
- Visualization: UMAP 2D projections (hyperparameter sweep), colored by communicative context with class markers
- Baseline: Mel spectrogram with time-shift invariant distance (Sainburg et al. 2020)
Requires pixi:
pixi installlib.py Shared utilities: model loading, audio loading, embedding extraction
extract.py Extract embeddings, save distance matrices and clustermaps
sweep.py Sweep UMAP/PCA/raw silhouette scores across layers and hyperparameters
classify.py LOO k-NN classification with confusion matrices and Spearman test
spectro.py Spectrogram baseline: mel spectrogram + time-shift distance + evaluation
compare.py Cross-dataset comparison plots (silhouette and accuracy vs layer)
pixi run python extract.py \
--model hubert \
--input-dir data/syllables \
--categories-csv data/syllables/categories.csv \
--output-dir results/toddlerRepeat with --model wavlm and --model aves. For a second dataset, use a different --output-dir (e.g. results/adult).
pixi run python sweep.py --model hubert --output-dir results/toddler --categories ABA AYA Eeah IMAProduces a summary CSV and best/worst UMAP scatter plot per model.
pixi run python classify.py --model hubert --output-dir results/toddler --categories ABA AYA Eeah IMAProduces accuracy CSV (layer x k x method), confusion matrix for the best configuration, and Spearman correlation test. For toddler data, automatically runs both class-level (7 categories) and context-level (4 communicative contexts) classification.
pixi run python spectro.py \
--input-dir data/syllables \
--categories-csv data/syllables/categories.csv \
--output-dir results/toddler \
--categories ABA AYA Eeah IMASweeps mel spectrogram parameters (n_mels, fmax, hop_length, n_fft), computes pairwise time-shift invariant distances, and evaluates silhouette score and LOO k-NN accuracy. Results appear as horizontal baseline lines on comparison plots.
pixi run python compare.py \
--dataset-dirs Toddler:results/toddler Adult:results/adult \
--categories ABA AYA Eeah IMAProduces PDFs per (model, metric, label type):
{model}_sil_vs_layer.pdf/{model}_sil_vs_layer_context.pdf— silhouette score vs layer{model}_acc_vs_layer.pdf/{model}_acc_vs_layer_context.pdf— LOO k-NN accuracy vs layer
Each plot shows solid lines (raw) and dashed lines (+PCA) per dataset, with bootstrap CI on the raw lines. Context-level plots use communicative context labels for toddler data (class = context for adult data).
filename,category
ABA_01.wav,ABA
IMA_02.wav,IMA| Name | ID | Architecture | Training data |
|---|---|---|---|
hubert |
facebook/hubert-base-ls960 |
12-layer transformer | LibriSpeech 960h |
wavlm |
microsoft/wavlm-base-plus |
12-layer transformer | 94k hours mixed speech |
aves |
aves-base-bio |
12-layer transformer | ~360h animal vocalizations |
All are pretrained base models (no finetuning, no classification head). AVES-bio is loaded via torchaudio's wav2vec2_model (not the esp-aves package).
results/
toddler/ # or adult/, etc.
hubert/
embeddings_layer{0-12}.npy # 768D embeddings per layer
categories.npy # category labels
hubert_sweep_summary_*.csv # silhouette sweep results
hubert_best_worst_*.pdf # best/worst UMAP scatter plots
hubert_knn_accuracy_*.csv # k-NN accuracy table
hubert_knn_spearman_*.csv # Spearman trend test
hubert_knn_best_confusion_*.pdf
wavlm/
...
aves/
...
spectro/
spectro_sweep_*.csv # spectrogram parameter sweep results
spectro_best_clustermap_*.pdf # clustermap for best parameters
{model}_sil_vs_layer.pdf # cross-dataset comparison plots (class-level)
{model}_sil_vs_layer_context.pdf # cross-dataset comparison plots (context-level)
{model}_acc_vs_layer.pdf
{model}_acc_vs_layer_context.pdf