EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Project Page • Arxiv Paper • Demo Video • Citation

Release Plans

Real-time deployment code.
Code for evaluating BEATv2 benchmark results.
Training code.

Environment Setup

We recommend using Conda with Python 3.13:

conda create -n echoavatar python=3.13
conda activate echoavatar
pip install -r requirements.txt

If you run the real-time deployment across two machines, install this environment on the Ubuntu inference server. On the local Windows machine, only the following packages are required:

pip install sounddevice keyboard huggingface_hub

Checkpoints

Download the checkpoints into the repository root before running inference:

hf download robinwitch/EchoAvatar --local-dir . --include "ckpts/**"
git clone https://huggingface.co/robinwitch/hf_transformer_mhubert_base_vp_en_es_fr_it3 ./ckpts/hf_transformer_mhubert_base_vp_en_es_fr_it3

Real-time deployment

EchoAvatar is a generic real-time audio-to-avatar-motion service. It receives audio, generates body motion and ARKit-compatible face coefficients, and streams the generated data to configurable output services.

audio source
  -> tools/pushwav2server.py
  -> audio2motion.py
  -> stream2vmc.py       -> body VMC receiver
  -> stream2livelink.py  -> facial LiveLink receiver

Recommended setup:

Audio capture machine: captures browser, system, music, or microphone output.
Inference server: runs EchoAvatar with NVIDIA GPU. For simultaneous face and body generation, dual RTX 3090 or better is recommended. If you only generate face motion or only body motion, one GPU is enough.
Output receivers: body motion is sent through VMC only; facial ARKit coefficients are sent through LiveLink only.

1. Configure Runtime

Edit config/echoavatar.toml. This file controls the inference model, audio ports, body VMC target, facial LiveLink target, and the two internal TCP ports used by the separated stream services.

Default local wiring:

audio2motion.py
  -> body_vmc      127.0.0.1:12346
  -> face_livelink 127.0.0.1:12348

stream2vmc.py
  listens on 12346
  sends body VMC UDP to 127.0.0.1:39539

stream2livelink.py
  listens on 12348
  sends facial LiveLink UDP to 127.0.0.1:11111

For multi-machine deployment, change the host fields under [[motion_receivers]] to the machines running stream2vmc.py and stream2livelink.py, and change each streamer's target_host to the final receiver machine.

Remote audio2motion hosting:

Run audio2motion.py on the GPU machine that has the EchoAvatar environment, checkpoints, CUDA, and vLLM installed.
Run stream2vmc.py and stream2livelink.py on the machine closest to the InZOI receiver, usually the local Windows or middleware machine.
In [audio_sender], set server_host to the GPU machine IP or DNS name.
In [[motion_receivers]], set host to the machine running each streamer as reachable from the GPU machine.
In [stream2vmc], set target_host and target_port for the final VMC UDP receiver.
In [stream2livelink], set target_host and target_port for the final LiveLink UDP receiver.

If the machines are not on the same LAN, use a VPN/tunnel such as Tailscale, WireGuard, or ZeroTier. Do not rely on public open ports unless you also add firewall allowlists and authentication at the network layer.

2. Start Body VMC Streamer

Start the body-only VMC service before audio2motion.py:

python stream2vmc.py

This service only reads pose and trans from the EchoAvatar packet and only sends VMC body bone messages. It does not send ARKit face data.

For direct InZOI body streaming, use coordinate_mode = "echoavatar_raw" in config/echoavatar.toml. EchoAvatar already swaps root translation to Y-up on the model host, and its vertical channel is pelvis height in centimeters. The stream receiver grounds /VMC/Ext/Root/Pos and passes the model host's already-converted stream quaternions through.

3. Start Facial LiveLink Streamer

Start the face-only LiveLink service before audio2motion.py:

python stream2livelink.py

This service sends EchoAvatar's raw 52 ARKit facial coefficients over LiveLink. The extra LiveLink head/eye rotation slots are left neutral.

4. Start Audio2Motion

Run the audio-to-motion inference service on the GPU server:

python audio2motion.py

audio2motion.py reads config/echoavatar.toml, starts the original EchoAvatar inference script, and fans out the neutral generated packet to each enabled [[motion_receivers]] entry.

5. Audio Capture

Install a virtual audio device such as VB-CABLE if you need to capture browser or application output. Route the application output to the virtual cable, then list local audio devices:

python tools/get_device.py

Start the audio sender:

python tools/pushwav2server.py

The audio sender reads [audio_sender] from config/echoavatar.toml. You can still override values from the CLI, for example:

python tools/pushwav2server.py --server-ip <gpu-server-ip> --device <audio-device-index>

6. Audio Input Protocol

EchoAvatar has one input contract: audio. Route audio to the capture device used by tools/pushwav2server.py, or implement the same TCP protocol directly.

Audio sender protocol:

TCP -> [audio2motion].audio_port
4-byte big-endian payload length
pickle payload containing a numpy-compatible audio block

Citation

If you find our code or paper helps, please consider citing:

@inproceedings{chen2026echo,
  author = {Bohong Chen and Yumeng Li and Yinglin Xu and Youyi Zheng and Yanlin Weng and Kun Zhou},
  title = {EchoAvatar: Real-time Generative Avatar Animation from Audio Streams},
  year = {2026},
  isbn = {9798400725548},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3799902.3811066},
  doi = {10.1145/3799902.3811066},
  booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  series = {SIGGRAPH Conference Papers '26}
}

Acknowledgments

Thanks to EMAGE, ZeroEGGS, MotoricaDanceDataset, motorica-retarget ,zeroeggs-retarget , torchtune, ichigo, T2M-GPT, MoMask, MECo, verl, vLLM, encodec, our code is partially borrowing from them. Please check these useful repos.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
docs		docs
models		models
refers/encodec		refers/encodec
scripts		scripts
stats		stats
tools		tools
utils		utils
.gitignore		.gitignore
audio2motion.py		audio2motion.py
echoavatar_config.py		echoavatar_config.py
process_zm_dataset.py		process_zm_dataset.py
readme.md		readme.md
received_audio.wav		received_audio.wav
requirements-client.txt		requirements-client.txt
requirements.txt		requirements.txt
stream2livelink.py		stream2livelink.py
stream2vmc.py		stream2vmc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Release Plans

Environment Setup

Checkpoints

Real-time deployment

1. Configure Runtime

2. Start Body VMC Streamer

3. Start Facial LiveLink Streamer

4. Start Audio2Motion

5. Audio Capture

6. Audio Input Protocol

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Release Plans

Environment Setup

Checkpoints

Real-time deployment

1. Configure Runtime

2. Start Body VMC Streamer

3. Start Facial LiveLink Streamer

4. Start Audio2Motion

5. Audio Capture

6. Audio Input Protocol

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages