SynthAVSR (Improving Audiovisual Speech Recognition with Visual Synthetic Data) 🎤🤖

Introduction

SynthAVSR is a research framework designed to explore the potential of synthetic audiovisual data for improving AVSR in low-resource languages, specifically Spanish🇪🇸 and Catalan. It implements a full pipeline for generating realistic synthetic lip videos from audio and static images, and fine-tunes AV-HuBERT for evaluating this data across multiple conditions.

Fine-tuned Models 🧩

Checkpoints and models adapted for our project are available in the table below:

Modality	MixSpeech_es	RealSpeech_es	SynthSpeech_es	SynthSpeech_cat
AudioVisual	Download	Download	Download	Download
Audio-Only	Download	Download	Download	Download
Visual-Only	Download	Download	Download	Download

Model Performance (WER) 🎯

AVSR Model Results

Model	LIP-RTVE	CMU-MOSEAS_ES	MuAViC_ES
MixSpeech_es	8.1%	12.9%	15.7%
RealSpeech_es	9.3%	15.4%	16.6%
SynthSpeech_es	21.1%	35.2%	39.6%

Model	AVCAT-Benchmark
SynthSpeech_cat	19.6%

Installation ⚙️

To get started with SynthAVSR, set up a Conda environment using the SynthAVSR.yml file provided:

Create and activate the environment:

conda env create -f SynthAVSR.yml
conda activate synth_avsr

Clone the repository:

git clone https://github.com/Pol-Buitrago/SynthAVSR.git
cd SynthAVSR
git submodule init
git submodule update

Data Preparation 📊

Follow the steps in preparation to pre-process:

LRS3 and VoxCeleb2 datasets. For any other dataset, follow an analogous procedure.

Follow the steps in clustering (for pre-training only) to create:

{train, valid}.km frame-aligned pseudo label files.
The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.

Training and Fine-tuning AV-HuBERT Models

Pre-train an AV-HuBERT model

To train a model, run the following command, adjusting paths as necessary:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

Fine-tune an AV-HuBERT model with Seq2Seq

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
  hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`

Decode an AV-HuBERT model

To decode a fine-tuned model, run:

$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
  dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
  common_eval.results_path=/path/to/experiment/decode/s2s/test \
  override.modalities=['auido,video'] common.user_dir=`pwd`

Parameters like generation.beam and generation.lenpen can be adjusted to fine-tune the decoding process.

License 📜

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share, copy, modify, and distribute this work, including for commercial purposes, as long as proper attribution is given to the original author.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Label_Inspector		Label_Inspector
Wav2Lip		Wav2Lip
avhubert		avhubert
fairseq @ 272c4c5		fairseq @ 272c4c5
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
avhubert_env.yml		avhubert_env.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthAVSR (Improving Audiovisual Speech Recognition with Visual Synthetic Data) 🎤🤖

Introduction

Fine-tuned Models 🧩

Model Performance (WER) 🎯

AVSR Model Results

Installation ⚙️

Data Preparation 📊

Training and Fine-tuning AV-HuBERT Models

Pre-train an AV-HuBERT model

Fine-tune an AV-HuBERT model with Seq2Seq

Decode an AV-HuBERT model

License 📜

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SynthAVSR (Improving Audiovisual Speech Recognition with Visual Synthetic Data) 🎤🤖

Introduction

Fine-tuned Models 🧩

Model Performance (WER) 🎯

AVSR Model Results

Installation ⚙️

Data Preparation 📊

Training and Fine-tuning AV-HuBERT Models

Pre-train an AV-HuBERT model

Fine-tune an AV-HuBERT model with Seq2Seq

Decode an AV-HuBERT model

License 📜

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages