- Author: Yashish Maduwantha
- Email: yashish@terpmail.umd.edu
The two saved checkpoints in this repo have been removed. Please reach out to Yashish (yashish@terpmail.umd.edu) if you are planning to use this repo for Research purposes.
The SSL-SI-tool implements the pipeline which can be directly used to estimate the articulatory features (6 TVs or 9 TVs + source features) given the speech utterance (.wav files).
This repository holds two Acoustic-to-Articulatory Speech Inversion (SI) systems trained on the Wisconsin XRMB dataset and the HPRC dataset respectively. The model architecture and training are based on the papers Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables, Audio Data Augmentation for Acoustic-to-articulatory Speech Inversion and "Acoustic-to-articulatory Speech Inversion with Multi-task Learning". The pretrained SI systems in this repository have been trained with self-supervised based features (HuBERT and wavLM) as acoustic inputs compared to the 13 MFCCs used in the papers above. Check the two papers above to refer to more information on the types of TVs estimated by each model.
- Model trained on XRMB dataset : Estimates 6 TVs
- Model trained on HPRC dataset : Trained with a MTL framework and estimates 9 TVs + Source features (Aperiodicity, Periodicity and Pitch)
Follow steps in run_instructions.txt to get started quickly !!
The SI systems were trained in a conda environment with Python 3.8.13 and tensorflow==2.10.0. The HuBERT pretrained models used to extract acoustic features have been trained in PyTorch.
- Installation method 1:
First install tensorflow and we recommend doing that in Conda following the steps here.
We also use a number of off the shelf libraries which are listed in requirements.txt. Follow the steps below to install them.
$ pip install speechbrain
$ pip install librosa
$ pip install transformers- Installation method 2 : Installing inidividual libraries from the requirements.txt file.
$ pip install -r requirements.txtWe recommed following method 1 since it will automatically take care of compatible libraries incase there have been new realase versions of respective libraries.
Note : If you run the SI system on GPUs to extract TVs (recommended for lareger datasets), make sure the cuDNN versions for pyTorch (installed by speechbrain) and the one installed with Tensorflow are compatible.
Execute run_SSL_SI_pipeline.py script to run the SI pipeline which performs the following 'steps',
- Run feature_extract.py script to do audio segmentation and extract specified SSL features using the speechbrain library
- Load the pre-trained SSL-SI model and evaluate on the extracted SSL feature data generated in step 1
- Save the predicted Tract Variables (TVs)
The tract variables can be saved as either numpy files or mat files for convenience. The TVs and source features are saved in the following order in the output files.
- 6TVs with XRMB : LA, LP, TBCL, TBCD, TTCL, TTCD
- 12 TVs with HPRC : LA, LP, TBCL, TBCD, TTCL, TTCD, JA, TMCL, TMCD, Periodicity, Aperiodicity, Pitch (normalized to 0 to 1 range)
usage: run_SSL_SI_pipeline.py [-h] [-m MODEL] [-f FEATS] [-i PATH]
[-o OUT_FORMAT]
Run the SI pipeline
optional arguments:
-h, --help show this help message and exit
-m MODEL, --model MODEL
set which SI system to run, xrmb trained (xrmb) or
hprc trained (hprc)
-f FEATS, --feats FEATS
set which SSL pretrained model to be used to extract
features, hubert to use HuBERT-large and wavlm to use
wavLM-large pretrained models
-i PATH, --path PATH path to directory with audio files
-o OUT_FORMAT, --out_format OUT_FORMAT
output TV file format (mat or npy)
- Run the pipeline from end to end (executes all 3 steps)
python run_SSL_SI_pipeline.py -m xrmb -f hubert -i test_audio/ -o 'mat'The SI systems trained with wavLM features will be added in the future. Only set -f parameter to 'hubert' at this point to run the models.
This project is licensed under the LICENSE-CC-BY-NC-ND-4.0 - see the LICENSE file for details