- Real-time deployment code.
- Code for evaluating BEATv2 benchmark results.
- Training code.
We recommend using Conda with Python 3.13:
conda create -n echoavatar python=3.13
conda activate echoavatar
pip install -r requirements.txtIf you run the real-time deployment across two machines, install this environment on the Ubuntu inference server. On the local Windows machine, only the following packages are required:
pip install sounddevice keyboard huggingface_hubDownload the checkpoints into the repository root before running inference:
hf download robinwitch/EchoAvatar --local-dir . --include "ckpts/**"
git clone https://huggingface.co/robinwitch/hf_transformer_mhubert_base_vp_en_es_fr_it3 ./ckpts/hf_transformer_mhubert_base_vp_en_es_fr_it3This section describes how to deploy the real-time inference pipeline. After the pipeline is running, audio from a browser or another local application can drive the avatar in Unity. We recommend first verifying the browser-audio workflow below, then connecting a voice agent to the same virtual-audio path.
The deployment has two sides:
- Local Windows machine: runs Unity, receives audio from the browser or voice agent, and uses
tools/pushwav2server.pyto stream audio to the server. - Ubuntu inference server: runs the audio-to-motion inference script, receives the audio stream, generates face/body motion, and sends motion data back to Unity.
The basic data flow is:
Browser / voice agent audio output
-> VB-CABLE virtual audio device
-> Unity calls tools/pushwav2server.py
-> Ubuntu audio-to-motion inference server
-> Unity receives motion data and drives the avatar
Recommended setup:
- Local side: Windows machine.
- Server side: Ubuntu server with NVIDIA GPU. For simultaneous face and body generation, we recommend dual RTX 3090 or better. If you only generate face motion or only body motion, one GPU is enough.
Install VB-CABLE. It is used to capture audio from the browser or another system application so that the streaming script can read it later.
Using Chrome as an example, open:
Settings -> Sound -> Volume mixer -> Apps
Find Google Chrome and set its output device to:
CABLE In 16ch (VB-Audio Virtual Cable)
Example:
If Chrome does not appear in the app list, open a webpage and play audio in Chrome first. Windows may only show Chrome in the mixer after it starts producing audio.
Copy the local repository tools directory to the Windows machine. Unity will call scripts from this directory, including tools/pushwav2server.py, to stream local audio to the Ubuntu inference server.
Then list the local audio devices:
python tools/get_device.pyFind the device whose name contains:
CABLE Output (VB-Audio Virtual Cable)
Record its device index and edit tools/pushwav2server.py:
- Set
SERVER_IP(Line 6) to the Ubuntu inference server IP. - Set
input_device_index(Line 8) to the device index ofCABLE Output. The default example value is2, but it may be different on your machine.
Download the Unity package to the Windows machine:
hf download robinwitch/EchoAvatar --local-dir . --include "echoavatar_unity.zip"Unzip echoavatar_unity.zip, then open Unity Hub and choose:
Add project from disk
The project uses Unity Editor 6000.0.58f1 (LTS).
In the Unity panel, set:
Python Exe Path: path to the localpython.exe.Python Script Path: path totools/pushwav2server.py.
Example:
Adjust the paths according to your local environment. When Unity enters the streaming workflow, it will call this script and send audio from VB-CABLE to the server.
Run the audio-to-motion inference script on the server:
scripts/5_streaming_vllm_unity_30fps_bp_attn4_encodec2_multirvq_nbc512_motionexample_withface_ik.py
Before launching it:
- Set
MOTION_SERVER_HOST(Line 41) to the IP address of the machine running Unity. After generating motion, the inference script connects to this address and sends motion data back to Unity. - Make sure port
12345on the Ubuntu server is reachable from the Windows machine. This port receives audio fromtools/pushwav2server.py. - Make sure the required checkpoints exist, for example
./ckpts/body_gor./ckpts/body_g_d.
MOTION_SERVER_PORT (Line 45) defaults to 12346 and usually does not need to be changed.
If you also want semantic action control, make sure port 12346 on the Ubuntu server is reachable. In the inference script, this is TEXT_SERVER_PORT (Line 726). You can ignore this when only testing browser-audio driving.
Start the Unity application on the Windows machine first and enter Play mode. The Unity-side motion receiver should be ready before the server script tries to connect to it.
For speech-to-gesture only, run:
python scripts/5_streaming_vllm_unity_30fps_bp_attn4_encodec2_multirvq_nbc512_motionexample_withface_ik.py --model_name ./ckpts/body_gThis mode does not require a specific speech timbre.
For both speech-to-gesture and music-to-dance, run:
python scripts/5_streaming_vllm_unity_30fps_bp_attn4_encodec2_multirvq_nbc512_motionexample_withface_ik.py --model_name ./ckpts/body_g_dDue to current dataset limitations, speech audio in this mode should be close to the training-data timbre. We recommend voice cloning with the female ZeroEGGS sample: tools/015_Happy_4_x_1_0.wav
Music audio has no specific timbre or genre requirement.
Only configure this step if you need semantic control. The script can run on the Windows machine or on any other machine that can access port 12346 on the Ubuntu inference server.
Before running it, edit tools/action_send.py and set ACTION_SERVER_HOST (Line 10) to the Ubuntu inference server IP.
Then send predefined semantic action signals with:
python tools/action_send.pyAfter the browser-audio workflow is verified, you can connect a voice agent. Whether you use a cloud API or a local deployment, the key requirement is the same: the voice agent's final audio output must be routed to VB-CABLE.
Options:
- Cloud API: ElevenLabs is recommended. It supports voice cloning and does not require extra local GPUs, but usually requires a paid subscription. See ElevenLabs voice agent setup.
- Local deployment: Pipecat can be used, but it usually requires additional GPU resources. See Pipecat setup.
Timbre requirements depend on the model:
- If you only need speech-to-gesture, the TTS voice does not need a specific timbre.
- If you need both speech-to-gesture and music-to-dance in the streaming process, we recommend cloning a female ZeroEGGS voice for speech TTS. The recommended reference audio is
tools/015_Happy_4_x_1_0.wav.
If you find our code or paper helps, please consider citing:
@inproceedings{chen2026echo,
author = {Bohong Chen and Yumeng Li and Yinglin Xu and Youyi Zheng and Yanlin Weng and Kun Zhou},
title = {EchoAvatar: Real-time Generative Avatar Animation from Audio Streams},
year = {2026},
isbn = {9798400725548},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3799902.3811066},
doi = {10.1145/3799902.3811066},
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
series = {SIGGRAPH Conference Papers '26}
}Thanks to EMAGE, ZeroEGGS, MotoricaDanceDataset, motorica-retarget ,zeroeggs-retarget , torchtune, ichigo, T2M-GPT, MoMask, MECo, verl, vLLM, encodec, our code is partially borrowing from them. Please check these useful repos.