The official PyTorch implementation of the paper "MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model".
Please visit our webpage for more details.
🎉 5/Jun/25 - Our work has been published in Pattern Recognition. Paper
📢 17/Jun/24 - First release - pretrained models, train and test code.
conda create -n mmofusion python=3.7
conda activate mmofusion
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r requirements.txtDownload the BEAT datasets, choose the English data v0.2.1.
We preprocess the data based on the DiffuseStyleGesture, thanks for their great work!
Download the audio prepocess model WavLM-Large and text prepocess model crawl-300d-2M.
cd ./process/
python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step1" "cuda:0"
python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ "/your/weights/WavLM-Large.pt" "/your/weights/crawl-300d-2M.vec" "v0" "step3" "cuda:0"The processed data will be saved in /path/to/BEAT/processed/, before converting it into H5file, you can split the data into train/val/test as our setting by the script data_split_30.ipynb. After that, you will get the H5file BEAT_v0_train.h5 by running:
python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step4" "cuda:0"and get the mean, std in ./process/ by running:
python calculate_gesture_statistics.py --dataset BEAT --version "v0"Download our pretrained models including motion generation with upper body and whole body.
You can also find the pretrained autoencoder model last_600000.bin, which we trained it on 30 speakers data.
Edit the model_path and e_path to load the pretrained models for test, and tst_path to load the processed test data.
cd ./mydiffusion/
for upper body
python sample_linear.py --config=./configs/mmofusion.yml --gpu 0
for whole body
python sample_linear.py --config=./configs/mmofusion_whole.yml --gpu 0You can also modify the weight guidance_param since we use the classifier-free guidance during training.
Edit the h5file in the config to load the H5file BEAT_v0_train.h5.
cd ./mydiffusion/
for upper body
python train.py --config=./configs/mmofusion.yml --gpu 0
for whole body
...cd ./mydiffusion/
# Edit your h5file path in the code
python train_ae.py --config=./configs/autoencoder.yml --gpu 0
# data preprocess
cd ./process/
python process_custom_data.py /your/data/path/ /your/save/path/ /your/weights/WavLM-Large.pt /your/weights/crawl-300d-2M.vec "cuda:0"
# motion generation
cd ./mydiffusion/
# Edit the id or emo in the config and your h5file path in the code
python custom2motion.py --config=./configs/edit_style.yml --gpu 0If you want to use your own speech data, you will first need the corresponding transcription (text). Then, use the Montreal Forced Aligner to generate a TextGrid file, which contains the alignment between the audio and the transcript.
If you find this repo useful for your research, please consider citing our paper:
@article{WANG2026111774,
title = {MMoFusion: Multi-modal co-speech motion generation with diffusion model},
ournal = {Pattern Recognition},
volume = {169},
pages = {111774},
year = {2026},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2025.111774},
author = {Sen Wang and Jiangning Zhang and Xin Tan and Zhifeng Xie and Chengjie Wang and Lizhuang Ma},
}
