An open-source implementation of Microsoft's VALL-E X zero-shot voice synthesis model.
Pre-trained models are now publicly available for research and application use.

VALL-E X is a powerful and innovative multilingual text-to-speech (TTS) model originally released by Microsoft. While Microsoft initially proposed the concept in their research paper, they did not release any code or pre-trained models. We recognized the potential and value of this technology, reproduced and trained an open-source VALL-E X model. We're excited to share our pre-trained model with the community so everyone can experience next-generation TTS. 🎧
For more details, please check the model card.
2023.09.10
- Support for AR decoder batch decoding for more stable generation results
2023.08.30
- Replaced EnCodec decoder with Vocos decoder for improved audio quality (thanks @v0xie)
2023.08.23
- Added long text generation functionality
2023.08.20
- Added Chinese README
2023.08.14
- Pre-trained model weights released, download from here
git clone https://github.com/naidu1212/SayOnce.git
cd SayOnce
pip install -r requirements.txt
Note: If you need to create prompts, you must install ffmpeg and add its folder to the PATH environment variable
The model files are NOT included in this repository due to their large size (~1.5 GB total). They will be automatically downloaded when you first run the application.
What gets downloaded:
-
VALL-E X Checkpoint (~300 MB) - Main voice cloning model
- Downloads to:
./checkpoints/vallex-checkpoint.pt - Source: https://huggingface.co/Plachta/VALL-E-X
- Downloads to:
-
Whisper Model (~1.45 GB) - Speech recognition for transcription
- Downloads to:
./whisper/medium.pt - Source: OpenAI Whisper
- Downloads to:
First run will take a few minutes to download these models. Subsequent runs will be instant.
If automatic download fails, you can manually download the models:
(Please note the case sensitivity of directories and folders)
-
Check if a
checkpointsfolder exists in the installation directory. If not, manually create acheckpointsfolder in the installation directory (./checkpoints/). -
Check if there is a
vallex-checkpoint.ptfile in thecheckpointsfolder. If not, please manually download thevallex-checkpoint.ptfile from here and place it in thecheckpointsfolder. -
Check if a
whisperfolder exists in the installation directory. If not, manually create awhisperfolder in the installation directory (./whisper/). -
Check if there is a
medium.ptfile in thewhisperfolder. If not, please manually download themedium.ptfile from here and place it in thewhisperfolder.
If you don't want to install locally, you can experience VALL-E X functionality online by clicking any of the links below.
VALL-E X is equipped with a series of cutting-edge features:
-
Multilingual TTS: Natural, expressive speech synthesis in three languages - English, Chinese, and Japanese.
-
Zero-shot Voice Cloning: With just a 3-10 second recording of any speaker, VALL-E X can generate personalized, high-quality speech that perfectly replicates their voice.
- Speech Emotion Control: VALL-E X can synthesize speech with the same emotion as a given speaker recording, adding more expressiveness to the audio.
- Zero-shot Cross-lingual Speech Synthesis: VALL-E X can synthesize speech in a different language than the speaker's native language, while maintaining the speaker's timbre and emotion without affecting accent and fluency. Here's an example using a Japanese native speaker for English and Chinese synthesis: 🇯🇵 🗣
- Accent Control: VALL-E X allows you to control the accent of synthesized audio, such as speaking Chinese with an English accent or vice versa. 🇨🇳 💬
- Acoustic Environment Preservation: When a speaker's recording is made in different acoustic environments, VALL-E X can preserve that acoustic environment, making the synthesized speech sound more natural.
You can visit our demo page to browse more examples!
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio
# download and load all models
preload_models()
# generate audio from text
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)
# save audio to disk
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)
# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)hamburger.webm
This VALL-E X implementation supports three languages: English, Chinese, and Japanese. You can specify the language by setting the `language` parameter. By default, the model will automatically detect the language.
text_prompt = """
チュソクは私のお気に入りの祭りです。 私は数日間休んで、友人や家族との時間を過ごすことができます。
"""
audio_array = generate_audio(text_prompt)vallex_japanese.webm
Note: Even when mixing multiple languages in a single sentence, VALL-E X can perfectly control accents, but you need to manually tag the language of each sentence so our G2P tool can recognize them.
text_prompt = """
[EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]
[ZH]这是历史的开始。 如果您想听更多,请继续。[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')vallex_codeswitch.webm
We provide dozens of speaker voices ready to use with VALL-E X! Browse all available voices here.
VALL-E X attempts to match the tone, pitch, emotion, and prosody of the given preset voice. The model also attempts to preserve music, ambient noise, etc.
text_prompt = """
I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.
"""
audio_array = generate_audio(text_prompt, prompt="dingzhen")smoky.webm
VALL-E X supports voice cloning! You can use anyone's voice, a character, or even your own voice to create an audio prompt. When you use that audio prompt, VALL-E X will synthesize text using a similar voice.
You need to provide a 3-10 second speech clip and the corresponding text for that speech to create an audio prompt. You can also leave the text blank and let the Whisper model generate the text for you.
VALL-E X attempts to match the tone, pitch, emotion, and prosody of the given audio prompt. The model also attempts to preserve music, ambient noise, etc.
from utils.prompt_making import make_prompt
### Use given transcript
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav",
transcript="Just, what was that? Paimon thought we were gonna get eaten.")
### Alternatively, use whisper
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")Let's try the audio prompt we just made!
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
# download and load all models
preload_models()
text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="paimon")
write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)paimon_prompt.webm
paimon_cloned.webm
If you're not comfortable with code, we've also created a user-friendly graphical interface for VALL-E X. It allows you to easily interact with the model, making voice cloning and multilingual speech synthesis a breeze.
Use the following command to launch the user interface:
python -X utf8 launch-ui.py
VALL-E X can run on CPU or GPU (pytorch 2.0+, CUDA 11.7 ~ CUDA 12.0).
If running on GPU, you need at least 6GB of VRAM.
VALL-E X is similar to Bark, VALL-E and AudioLM, using GPT-style models to predict quantized audio tokens in an autoregressive manner, decoded by EnCodec.
Compared to Bark:
- ✔ Lightweight: 3️⃣ ✖ smaller,
- ✔ Fast: 4️⃣ ✖ faster,
- ✔ Higher quality for Chinese & Japanese
- ✔ No foreign accent in cross-lingual synthesis
- ✔ Open and easy-to-use voice cloning
- ❌ Fewer supported languages
- ❌ No tokens for synthesizing music and special sound effects
| Language | Status |
|---|---|
| English (en) | ✅ |
| Japanese (ja) | ✅ |
| Chinese (zh) | ✅ |
- When you run the program for the first time, we use
wgetto download the model to the./checkpoints/directory. - If the download fails on the first run, please manually download the model from here and place the file in
./checkpoints/.
- 6GB VRAM (GPU VRAM) - almost all NVIDIA GPUs meet the requirements.
Yes! The model supports long text generation through automatic sentence chunking.
How it works:
- Long text is automatically split into sentences
- Each sentence is generated separately (keeping each under ~22 seconds)
- The audio clips are seamlessly concatenated together
Technical limitation:
- The Transformer model has a 22-second processing limit per chunk due to computational complexity
- Your voice prompt (3-10 seconds) + each generated sentence should stay under 22 seconds
- For very long paragraphs, the system handles this automatically
In practice:
- ✅ You can generate paragraphs, articles, or long scripts
- ✅ The UI automatically handles the chunking for you
- ✅ Just paste your text and click generate!
Example: A 5-minute speech will be split into ~15-20 sentences and generated sequentially.
- Add Chinese README
- Long text generation
- Replace Encodec decoder with Vocos decoder
- Fine-tuning for better voice adaptation
-
.batscripts for non-Python users - More...
- VALL-E X paper for the brilliant idea
- lifeiteng's vall-e for related training code
- bark for the amazing pioneering work in neuro-codec TTS model
If you find VALL-E X interesting and useful, please give us a star on GitHub! ⭐️ It encourages us to continuously improve the model and add exciting features.
VALL-E X uses the MIT License.
Have questions or need help? Feel free to open an issue or join our Discord
Happy voice cloning! 🎤