- Real-time deployment code.
- Code for evaluating BEATv2 benchmark results.
- Training code.
We recommend using Conda with Python 3.13:
conda create -n echoavatar python=3.13
conda activate echoavatar
pip install -r requirements.txtIf you run the real-time deployment across two machines, install this environment on the Ubuntu inference server. On the local Windows machine, only the following packages are required:
pip install sounddevice keyboard huggingface_hubDownload the checkpoints into the repository root before running inference:
hf download robinwitch/EchoAvatar --local-dir . --include "ckpts/**"
git clone https://huggingface.co/robinwitch/hf_transformer_mhubert_base_vp_en_es_fr_it3 ./ckpts/hf_transformer_mhubert_base_vp_en_es_fr_it3EchoAvatar is a generic real-time audio-to-avatar-motion service. It receives audio, generates body motion and ARKit-compatible face coefficients, and streams the generated data to configurable output services.
audio source
-> tools/pushwav2server.py
-> audio2motion.py
-> stream2vmc.py -> body VMC receiver
-> stream2livelink.py -> facial LiveLink receiver
Recommended setup:
- Audio capture machine: captures browser, system, music, or microphone output.
- Inference server: runs EchoAvatar with NVIDIA GPU. For simultaneous face and body generation, dual RTX 3090 or better is recommended. If you only generate face motion or only body motion, one GPU is enough.
- Output receivers: body motion is sent through VMC only; facial ARKit coefficients are sent through LiveLink only.
Edit config/echoavatar.toml. This file controls the inference model, audio ports, body VMC target, facial LiveLink target, and the two internal TCP ports used by the separated stream services.
Default local wiring:
audio2motion.py
-> body_vmc 127.0.0.1:12346
-> face_livelink 127.0.0.1:12348
stream2vmc.py
listens on 12346
sends body VMC UDP to 127.0.0.1:39539
stream2livelink.py
listens on 12348
sends facial LiveLink UDP to 127.0.0.1:11111
For multi-machine deployment, change the host fields under
[[motion_receivers]] to the machines running stream2vmc.py and
stream2livelink.py, and change each streamer's target_host to the final
receiver machine.
Remote audio2motion hosting:
- Run
audio2motion.pyon the GPU machine that has the EchoAvatar environment, checkpoints, CUDA, and vLLM installed. - Run
stream2vmc.pyandstream2livelink.pyon the machine closest to the InZOI receiver, usually the local Windows or middleware machine. - In
[audio_sender], setserver_hostto the GPU machine IP or DNS name. - In
[[motion_receivers]], sethostto the machine running each streamer as reachable from the GPU machine. - In
[stream2vmc], settarget_hostandtarget_portfor the final VMC UDP receiver. - In
[stream2livelink], settarget_hostandtarget_portfor the final LiveLink UDP receiver.
If the machines are not on the same LAN, use a VPN/tunnel such as Tailscale, WireGuard, or ZeroTier. Do not rely on public open ports unless you also add firewall allowlists and authentication at the network layer.
Start the body-only VMC service before audio2motion.py:
python stream2vmc.pyThis service only reads pose and trans from the EchoAvatar packet and only
sends VMC body bone messages. It does not send ARKit face data.
For direct InZOI body streaming, use coordinate_mode = "echoavatar_raw" in
config/echoavatar.toml. EchoAvatar already swaps root translation to Y-up on
the model host, and its vertical channel is pelvis height in centimeters. The
stream receiver grounds /VMC/Ext/Root/Pos and passes the model host's
already-converted stream quaternions through.
Start the face-only LiveLink service before audio2motion.py:
python stream2livelink.pyThis service sends EchoAvatar's raw 52 ARKit facial coefficients over LiveLink. The extra LiveLink head/eye rotation slots are left neutral.
Run the audio-to-motion inference service on the GPU server:
python audio2motion.pyaudio2motion.py reads config/echoavatar.toml, starts the original
EchoAvatar inference script, and fans out the neutral generated packet to each
enabled [[motion_receivers]] entry.
Install a virtual audio device such as VB-CABLE if you need to capture browser or application output. Route the application output to the virtual cable, then list local audio devices:
python tools/get_device.pyStart the audio sender:
python tools/pushwav2server.pyThe audio sender reads [audio_sender] from config/echoavatar.toml. You can
still override values from the CLI, for example:
python tools/pushwav2server.py --server-ip <gpu-server-ip> --device <audio-device-index>EchoAvatar has one input contract: audio. Route audio to the capture device used
by tools/pushwav2server.py, or implement the same TCP protocol directly.
Audio sender protocol:
TCP -> [audio2motion].audio_port
4-byte big-endian payload length
pickle payload containing a numpy-compatible audio block
If you find our code or paper helps, please consider citing:
@inproceedings{chen2026echo,
author = {Bohong Chen and Yumeng Li and Yinglin Xu and Youyi Zheng and Yanlin Weng and Kun Zhou},
title = {EchoAvatar: Real-time Generative Avatar Animation from Audio Streams},
year = {2026},
isbn = {9798400725548},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3799902.3811066},
doi = {10.1145/3799902.3811066},
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
series = {SIGGRAPH Conference Papers '26}
}Thanks to EMAGE, ZeroEGGS, MotoricaDanceDataset, motorica-retarget ,zeroeggs-retarget , torchtune, ichigo, T2M-GPT, MoMask, MECo, verl, vLLM, encodec, our code is partially borrowing from them. Please check these useful repos.