Skip to content

Pol-Buitrago/SynthAVSR

Repository files navigation


SynthAVSR (Improving Audiovisual Speech Recognition with Visual Synthetic Data) πŸŽ€πŸ€–

Introduction

SynthAVSR is a research framework designed to explore the potential of synthetic audiovisual data for improving AVSR in low-resource languages, specifically SpanishπŸ‡ͺπŸ‡Έ and Catalan. It implements a full pipeline for generating realistic synthetic lip videos from audio and static images, and fine-tunes AV-HuBERT for evaluating this data across multiple conditions.


Fine-tuned Models 🧩

Checkpoints and models adapted for our project are available in the table below:

Modality MixSpeeches RealSpeeches SynthSpeeches SynthSpeechcat
AudioVisual Download Download Download Download
Audio-Only Download Download Download Download
Visual-Only Download Download Download Download

Model Performance (WER) 🎯

AVSR Model Results

Model LIP-RTVE CMU-MOSEASES MuAViCES
MixSpeeches 8.1% 12.9% 15.7%
RealSpeeches 9.3% 15.4% 16.6%
SynthSpeeches 21.1% 35.2% 39.6%
Model AVCAT-Benchmark
SynthSpeechcat 19.6%

Installation βš™οΈ

To get started with SynthAVSR, set up a Conda environment using the SynthAVSR.yml file provided:

  1. Create and activate the environment:

    conda env create -f SynthAVSR.yml
    conda activate synth_avsr
  2. Clone the repository:

    git clone https://github.com/Pol-Buitrago/SynthAVSR.git
    cd SynthAVSR
    git submodule init
    git submodule update

Data Preparation πŸ“Š

Follow the steps in preparation to pre-process:

  • LRS3 and VoxCeleb2 datasets. For any other dataset, follow an analogous procedure.

Follow the steps in clustering (for pre-training only) to create:

  • {train, valid}.km frame-aligned pseudo label files.
    The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.

Training and Fine-tuning AV-HuBERT Models

Pre-train an AV-HuBERT model

To train a model, run the following command, adjusting paths as necessary:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

Fine-tune an AV-HuBERT model with Seq2Seq

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
  hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`

Decode an AV-HuBERT model

To decode a fine-tuned model, run:

$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
  dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
  common_eval.results_path=/path/to/experiment/decode/s2s/test \
  override.modalities=['auido,video'] common.user_dir=`pwd`

Parameters like generation.beam and generation.lenpen can be adjusted to fine-tune the decoding process.


License πŸ“œ

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share, copy, modify, and distribute this work, including for commercial purposes, as long as proper attribution is given to the original author.


About

πŸš€ SynthAVSR is a research framework for training and evaluating audiovisual speech recognition (AVSR) models using synthetic data β€” with a focus on low-resource languages like Spanish and Catalan.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors