EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Project Page • Arxiv Paper • Demo Video • Citation

Release Plans

Real-time deployment code.
Code for evaluating BEATv2 benchmark results.
Training code.

Environment Setup

We recommend using Conda with Python 3.13:

conda create -n echoavatar python=3.13
conda activate echoavatar
pip install -r requirements.txt

If you run the real-time deployment across two machines, install this environment on the Ubuntu inference server. On the local Windows machine, only the following packages are required:

pip install sounddevice keyboard huggingface_hub

Checkpoints

Download the checkpoints into the repository root before running inference:

hf download robinwitch/EchoAvatar --local-dir . --include "ckpts/**"
git clone https://huggingface.co/robinwitch/hf_transformer_mhubert_base_vp_en_es_fr_it3 ./ckpts/hf_transformer_mhubert_base_vp_en_es_fr_it3

Real-time deployment

This section describes how to deploy the real-time inference pipeline. After the pipeline is running, audio from a browser or another local application can drive the avatar in Unity. We recommend first verifying the browser-audio workflow below, then connecting a voice agent to the same virtual-audio path.

1. Overview

The deployment has two sides:

Local Windows machine: runs Unity, receives audio from the browser or voice agent, and uses tools/pushwav2server.py to stream audio to the server.
Ubuntu inference server: runs the audio-to-motion inference script, receives the audio stream, generates face/body motion, and sends motion data back to Unity.

The basic data flow is:

Browser / voice agent audio output
  -> VB-CABLE virtual audio device
  -> Unity calls tools/pushwav2server.py
  -> Ubuntu audio-to-motion inference server
  -> Unity receives motion data and drives the avatar

Recommended setup:

Local side: Windows machine.
Server side: Ubuntu server with NVIDIA GPU. For simultaneous face and body generation, we recommend dual RTX 3090 or better. If you only generate face motion or only body motion, one GPU is enough.

2. Windows setup

2.1 Install a virtual audio device

Install VB-CABLE. It is used to capture audio from the browser or another system application so that the streaming script can read it later.

2.2 Route browser audio to VB-CABLE

Using Chrome as an example, open:

Settings -> Sound -> Volume mixer -> Apps

Find Google Chrome and set its output device to:

CABLE In 16ch (VB-Audio Virtual Cable)

Example:

If Chrome does not appear in the app list, open a webpage and play audio in Chrome first. Windows may only show Chrome in the mixer after it starts producing audio.

2.3 Prepare the audio streaming tools

Copy the local repository tools directory to the Windows machine. Unity will call scripts from this directory, including tools/pushwav2server.py, to stream local audio to the Ubuntu inference server.

Then list the local audio devices:

python tools/get_device.py

Find the device whose name contains:

CABLE Output (VB-Audio Virtual Cable)

Record its device index and edit tools/pushwav2server.py:

Set SERVER_IP (Line 6) to the Ubuntu inference server IP.
Set input_device_index (Line 8) to the device index of CABLE Output. The default example value is 2, but it may be different on your machine.

2.4 Configure the Unity project

Download the Unity package to the Windows machine:

hf download robinwitch/EchoAvatar --local-dir . --include "echoavatar_unity.zip"

Unzip echoavatar_unity.zip, then open Unity Hub and choose: Add project from disk

The project uses Unity Editor 6000.0.58f1 (LTS).

In the Unity panel, set:

Python Exe Path: path to the local python.exe.
Python Script Path: path to tools/pushwav2server.py.

Example:

Adjust the paths according to your local environment. When Unity enters the streaming workflow, it will call this script and send audio from VB-CABLE to the server.

3. Ubuntu server setup

Run the audio-to-motion inference script on the server:

scripts/5_streaming_vllm_unity_30fps_bp_attn4_encodec2_multirvq_nbc512_motionexample_withface_ik.py

Before launching it:

Set MOTION_SERVER_HOST (Line 41) to the IP address of the machine running Unity. After generating motion, the inference script connects to this address and sends motion data back to Unity.
Make sure port 12345 on the Ubuntu server is reachable from the Windows machine. This port receives audio from tools/pushwav2server.py.
Make sure the required checkpoints exist, for example ./ckpts/body_g or ./ckpts/body_g_d.

MOTION_SERVER_PORT (Line 45) defaults to 12346 and usually does not need to be changed.

If you also want semantic action control, make sure port 12346 on the Ubuntu server is reachable. In the inference script, this is TEXT_SERVER_PORT (Line 726). You can ignore this when only testing browser-audio driving.

4. Launch order

4.1 Start Unity

Start the Unity application on the Windows machine first and enter Play mode. The Unity-side motion receiver should be ready before the server script tries to connect to it.

4.2 Start the server inference script

For speech-to-gesture only, run:

python scripts/5_streaming_vllm_unity_30fps_bp_attn4_encodec2_multirvq_nbc512_motionexample_withface_ik.py --model_name ./ckpts/body_g

This mode does not require a specific speech timbre.

For both speech-to-gesture and music-to-dance, run:

python scripts/5_streaming_vllm_unity_30fps_bp_attn4_encodec2_multirvq_nbc512_motionexample_withface_ik.py --model_name ./ckpts/body_g_d

Due to current dataset limitations, speech audio in this mode should be close to the training-data timbre. We recommend voice cloning with the female ZeroEGGS sample: tools/015_Happy_4_x_1_0.wav

Music audio has no specific timbre or genre requirement.

5. Optional: semantic action control

Only configure this step if you need semantic control. The script can run on the Windows machine or on any other machine that can access port 12346 on the Ubuntu inference server.

Before running it, edit tools/action_send.py and set ACTION_SERVER_HOST (Line 10) to the Ubuntu inference server IP.

Then send predefined semantic action signals with:

python tools/action_send.py

6. Voice agent integration

After the browser-audio workflow is verified, you can connect a voice agent. Whether you use a cloud API or a local deployment, the key requirement is the same: the voice agent's final audio output must be routed to VB-CABLE.

Options:

Cloud API: ElevenLabs is recommended. It supports voice cloning and does not require extra local GPUs, but usually requires a paid subscription. See ElevenLabs voice agent setup.
Local deployment: Pipecat can be used, but it usually requires additional GPU resources. See Pipecat setup.

Timbre requirements depend on the model:

If you only need speech-to-gesture, the TTS voice does not need a specific timbre.
If you need both speech-to-gesture and music-to-dance in the streaming process, we recommend cloning a female ZeroEGGS voice for speech TTS. The recommended reference audio is tools/015_Happy_4_x_1_0.wav.

Citation

If you find our code or paper helps, please consider citing:

@inproceedings{chen2026echo,
  author = {Bohong Chen and Yumeng Li and Yinglin Xu and Youyi Zheng and Yanlin Weng and Kun Zhou},
  title = {EchoAvatar: Real-time Generative Avatar Animation from Audio Streams},
  year = {2026},
  isbn = {9798400725548},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3799902.3811066},
  doi = {10.1145/3799902.3811066},
  booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  series = {SIGGRAPH Conference Papers '26}
}

Acknowledgments

Thanks to EMAGE, ZeroEGGS, MotoricaDanceDataset, motorica-retarget ,zeroeggs-retarget , torchtune, ichigo, T2M-GPT, MoMask, MECo, verl, vLLM, encodec, our code is partially borrowing from them. Please check these useful repos.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
models		models
refers/encodec		refers/encodec
scripts		scripts
stats		stats
tools		tools
utils		utils
.gitignore		.gitignore
process_zm_dataset.py		process_zm_dataset.py
readme.md		readme.md
received_audio.wav		received_audio.wav
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Release Plans

Environment Setup

Checkpoints

Real-time deployment

1. Overview

2. Windows setup

2.1 Install a virtual audio device

2.2 Route browser audio to VB-CABLE

2.3 Prepare the audio streaming tools

2.4 Configure the Unity project

3. Ubuntu server setup

4. Launch order

4.1 Start Unity

4.2 Start the server inference script

5. Optional: semantic action control

6. Voice agent integration

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Release Plans

Environment Setup

Checkpoints

Real-time deployment

1. Overview

2. Windows setup

2.1 Install a virtual audio device

2.2 Route browser audio to VB-CABLE

2.3 Prepare the audio streaming tools

2.4 Configure the Unity project

3. Ubuntu server setup

4. Launch order

4.1 Start Unity

4.2 Start the server inference script

5. Optional: semantic action control

6. Voice agent integration

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages