Skip to content

RobinWitch/MECo

Repository files navigation

MECO: Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

Environment Setup

Create a conda environment with Python 3.13:

conda create -n meco python=3.13
conda activate meco

Install torchtune as a submodule:

git submodule update --init --recursive
cd torchtune
pip install -e .
cd ../

Install required dependencies:

pip install -r requirement_list.txt

Model Download

Download Base Model

mkdir ckpt
cd ckpt
git clone https://huggingface.co/robinwitch/MECo_BEAT2_2_RVQVAE
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
git clone https://huggingface.co/robinwitch/MECo_BEAT2_2_qwen2.5_0.5b_stage1
git clone https://huggingface.co/robinwitch/MECo_BEAT2_2_qwen2.5_0.5b_stage2
git clone https://huggingface.co/robinwitch/MECo_BEAT2_2_qwen2.5_0.5b_stage3
cd ../

Download Dataset

Download origin BEAT2 dataset:

huggingface-cli download H-Liu1997/BEAT2 \
  --repo-type dataset \
  --local-dir ./dataset/BEAT2 \
  --include "beat_english_v2.0.0/*"

Download additional dataset preprocessing model from Google Drive:

gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./dataset/hub --folder

Download tokenized BEAT2 dataset:

git clone https://huggingface.co/datasets/robinwitch/meco_mhubert1000_beat2_2 dataset/meco_mhubert1000_beat2_2

Download mHuBERT Model

mkdir ckpt/mhubert_base_1000
wget -P ckpt/mhubert_base_1000 https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget -P ckpt/mhubert_base_1000 https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin

Clone fairseq repository:

mkdir refer
git clone https://github.com/facebookresearch/fairseq.git refer/fairseq

Python 3.13 Compatibility Fix

When using fairseq with Python 3.13, you may encounter a ValueError related to mutable defaults. To fix this issue:

  1. Locate your Python dataclasses file: /path/to/your/env/lib/python3.13/dataclasses.py
  2. Comment out lines 859-861:
# if f._field_type is _FIELD and f.default.__class__.__hash__ is None:
#     raise ValueError(f'mutable default {type(f.default)} for field '
#                      f'{f.name} is not allowed: use default_factory')

Training

Follow these steps to train the model:

Step 1: Calculate Mean and Standard Deviation

python scripts_vqvae/0_get_mean_std.py

Step 2: Train RVQVAE

python scripts_vqvae/1_train_vqvae.py

Step 3: Build LLM with Expanded Vocabulary

python scripts/0_build_qwen2.5_0.5b.py

Step 4: Disable Tied Embedding

python scripts/1_detied_embedding.py

Step 5: Build Training Dataset

python scripts/2_build_dataset_mhubert1000.py

Step 6: Three-Stage LLM Training

Execute the three training commands in scripts/3_train_llm.md sequentially to complete the three-stage training of the LLM.

Evaluation

Audio-Only Speech-to-Gesture Evaluation

python scripts/4_eval_gpt_decode_mhubert1000.py

Speech-to-Gesture Evaluation with Motion Prompts

python scripts/5_eval_gpt_decode_mhubert1000_withprompt.py

Visualization

Following EMAGE, you can download SMPLX blender addon, and install it in your blender 3.x or 4.x. Click the button Add Animation to visualize the generated smplx file (like xxx.npz).

Performance Benchmarks

Due to the inherent diversity of GPT model inference:

  • Stage 2 (Audio-only generation): FID scores typically range from 0.32 to 0.36
  • Stage 3 (With motion example control): FID scores typically range from 0.27 to 0.29

Citation

If you find this work useful, please consider citing:

@inproceedings{chen2025meco,
  author = {Bohong Chen and Yumeng Li and Youyi Zheng and Yao-Xiang Ding and Kun Zhou},
  title = {Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models},
  year = {2025},
  isbn = {9798400715402},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3721238.3730611},
  doi = {10.1145/3721238.3730611},
  booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  series = {SIGGRAPH Conference Papers '25}
}

Acknowledgments

Thanks to EMAGE, torchtune, ichigo, T2M-GPT, MoMask, SynTalker, our code is partially borrowing from them. Please check these useful repos.

About

Official Implementation of the Paper:Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models (Siggraph 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages