SynthAVSR is a research framework designed to explore the potential of synthetic audiovisual data for improving AVSR in low-resource languages, specifically SpanishπͺπΈ and Catalan. It implements a full pipeline for generating realistic synthetic lip videos from audio and static images, and fine-tunes AV-HuBERT for evaluating this data across multiple conditions.
Checkpoints and models adapted for our project are available in the table below:
| Modality | MixSpeeches | RealSpeeches | SynthSpeeches | SynthSpeechcat |
|---|---|---|---|---|
| AudioVisual | Download | Download | Download | Download |
| Audio-Only | Download | Download | Download | Download |
| Visual-Only | Download | Download | Download | Download |
| Model | LIP-RTVE | CMU-MOSEASES | MuAViCES |
|---|---|---|---|
| MixSpeeches | 8.1% | 12.9% | 15.7% |
| RealSpeeches | 9.3% | 15.4% | 16.6% |
| SynthSpeeches | 21.1% | 35.2% | 39.6% |
| Model | AVCAT-Benchmark |
|---|---|
| SynthSpeechcat | 19.6% |
To get started with SynthAVSR, set up a Conda environment using the SynthAVSR.yml file provided:
-
Create and activate the environment:
conda env create -f SynthAVSR.yml conda activate synth_avsr
-
Clone the repository:
git clone https://github.com/Pol-Buitrago/SynthAVSR.git cd SynthAVSR git submodule init git submodule update
Follow the steps in preparation to pre-process:
- LRS3 and VoxCeleb2 datasets. For any other dataset, follow an analogous procedure.
Follow the steps in clustering (for pre-training only) to create:
{train, valid}.kmframe-aligned pseudo label files.
Thelabel_rateis the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.
To train a model, run the following command, adjusting paths as necessary:
$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
task.data=/path/to/data task.label_dir=/path/to/label \
model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
common.user_dir=`pwd`To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:
$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
task.data=/path/to/data task.label_dir=/path/to/label \
task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`To decode a fine-tuned model, run:
$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
common_eval.results_path=/path/to/experiment/decode/s2s/test \
override.modalities=['auido,video'] common.user_dir=`pwd`Parameters like generation.beam and generation.lenpen can be adjusted to fine-tune the decoding process.
This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share, copy, modify, and distribute this work, including for commercial purposes, as long as proper attribution is given to the original author.