MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

The official PyTorch implementation of the paper "MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model".

Please visit our webpage for more details.

News

🎉 5/Jun/25 - Our work has been published in Pattern Recognition. Paper

📢 17/Jun/24 - First release - pretrained models, train and test code.

1. Setup environment

conda create -n mmofusion python=3.7
conda activate mmofusion

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r requirements.txt

2. Data preparation

Download the BEAT datasets, choose the English data v0.2.1.

We preprocess the data based on the DiffuseStyleGesture, thanks for their great work!

Download the audio prepocess model WavLM-Large and text prepocess model crawl-300d-2M.

cd ./process/

python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step1" "cuda:0"

python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ "/your/weights/WavLM-Large.pt" "/your/weights/crawl-300d-2M.vec" "v0" "step3" "cuda:0"

The processed data will be saved in /path/to/BEAT/processed/, before converting it into H5file, you can split the data into train/val/test as our setting by the script data_split_30.ipynb. After that, you will get the H5file BEAT_v0_train.h5 by running:

python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step4" "cuda:0"

and get the mean, std in ./process/ by running:

python calculate_gesture_statistics.py --dataset BEAT --version "v0"

3. Test

Download our pretrained models including motion generation with upper body and whole body.

You can also find the pretrained autoencoder model last_600000.bin, which we trained it on 30 speakers data.

Edit the model_path and e_path to load the pretrained models for test, and tst_path to load the processed test data.

cd ./mydiffusion/

for upper body
python sample_linear.py --config=./configs/mmofusion.yml --gpu 0

for whole body
python sample_linear.py --config=./configs/mmofusion_whole.yml --gpu 0

You can also modify the weight guidance_param since we use the classifier-free guidance during training.

4. Train

Edit the h5file in the config to load the H5file BEAT_v0_train.h5.

cd ./mydiffusion/

for upper body
python train.py --config=./configs/mmofusion.yml --gpu 0

for whole body 
...

5. Train autoencoder for FGD

cd ./mydiffusion/

# Edit your h5file path in the code
python train_ae.py --config=./configs/autoencoder.yml --gpu 0

6. Custom Motion Tutorial

# data preprocess
cd ./process/
python process_custom_data.py  /your/data/path/ /your/save/path/ /your/weights/WavLM-Large.pt /your/weights/crawl-300d-2M.vec "cuda:0"

# motion generation
cd ./mydiffusion/
# Edit the id or emo in the config and your h5file path in the code
python custom2motion.py --config=./configs/edit_style.yml --gpu 0

If you want to use your own speech data, you will first need the corresponding transcription (text). Then, use the Montreal Forced Aligner to generate a TextGrid file, which contains the alignment between the audio and the transcript.

Citation

If you find this repo useful for your research, please consider citing our paper:

@article{WANG2026111774,
      title = {MMoFusion: Multi-modal co-speech motion generation with diffusion model},
      ournal = {Pattern Recognition},
      volume = {169},
      pages = {111774},
      year = {2026},
      issn = {0031-3203},
      doi = {https://doi.org/10.1016/j.patcog.2025.111774},
      author = {Sen Wang and Jiangning Zhang and Xin Tan and Zhifeng Xie and Chengjie Wang and Lizhuang Ma},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
diffusion		diffusion
model		model
mydiffusion		mydiffusion
process		process
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mmofusion.png		mmofusion.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

News

1. Setup environment

2. Data preparation

3. Test

4. Train

5. Train autoencoder for FGD

6. Custom Motion Tutorial

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

wangsen99/MMoFusion

Folders and files

Latest commit

History

Repository files navigation

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

News

1. Setup environment

2. Data preparation

3. Test

4. Train

5. Train autoencoder for FGD

6. Custom Motion Tutorial

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages