VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning 🔊

An open-source implementation of Microsoft's VALL-E X zero-shot voice synthesis model.
Pre-trained models are now publicly available for research and application use.

VALL-E X is a powerful and innovative multilingual text-to-speech (TTS) model originally released by Microsoft. While Microsoft initially proposed the concept in their research paper, they did not release any code or pre-trained models. We recognized the potential and value of this technology, reproduced and trained an open-source VALL-E X model. We're excited to share our pre-trained model with the community so everyone can experience next-generation TTS. 🎧
For more details, please check the model card.

🚀 Updates

2023.09.10

Support for AR decoder batch decoding for more stable generation results

2023.08.30

Replaced EnCodec decoder with Vocos decoder for improved audio quality (thanks @v0xie)

2023.08.23

Added long text generation functionality

2023.08.20

Added Chinese README

2023.08.14

Pre-trained model weights released, download from here

💻 Installation

Install using pip, requires Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+

git clone https://github.com/naidu1212/SayOnce.git
cd SayOnce
pip install -r requirements.txt

Note: If you need to create prompts, you must install ffmpeg and add its folder to the PATH environment variable

📦 Model Files (Auto-Downloaded)

The model files are NOT included in this repository due to their large size (~1.5 GB total). They will be automatically downloaded when you first run the application.

What gets downloaded:

VALL-E X Checkpoint (~300 MB) - Main voice cloning model
- Downloads to: ./checkpoints/vallex-checkpoint.pt
- Source: https://huggingface.co/Plachta/VALL-E-X
Whisper Model (~1.45 GB) - Speech recognition for transcription
- Downloads to: ./whisper/medium.pt
- Source: OpenAI Whisper

First run will take a few minutes to download these models. Subsequent runs will be instant.

Manual Download (Optional)

If automatic download fails, you can manually download the models:

(Please note the case sensitivity of directories and folders)

Check if a checkpoints folder exists in the installation directory. If not, manually create a checkpoints folder in the installation directory (./checkpoints/).
Check if there is a vallex-checkpoint.pt file in the checkpoints folder. If not, please manually download the vallex-checkpoint.pt file from here and place it in the checkpoints folder.
Check if a whisper folder exists in the installation directory. If not, manually create a whisper folder in the installation directory (./whisper/).
Check if there is a medium.pt file in the whisper folder. If not, please manually download the medium.pt file from here and place it in the whisper folder.

🎧 Online Demo

If you don't want to install locally, you can experience VALL-E X functionality online by clicking any of the links below.

📢 Features

VALL-E X is equipped with a series of cutting-edge features:

Multilingual TTS: Natural, expressive speech synthesis in three languages - English, Chinese, and Japanese.
Zero-shot Voice Cloning: With just a 3-10 second recording of any speaker, VALL-E X can generate personalized, high-quality speech that perfectly replicates their voice.

View Example

prompt.webm

output.webm

Speech Emotion Control: VALL-E X can synthesize speech with the same emotion as a given speaker recording, adding more expressiveness to the audio.

View Example

sleepy-prompt.mp4

sleepy-output.mp4

Zero-shot Cross-lingual Speech Synthesis: VALL-E X can synthesize speech in a different language than the speaker's native language, while maintaining the speaker's timbre and emotion without affecting accent and fluency. Here's an example using a Japanese native speaker for English and Chinese synthesis: 🇯🇵 🗣

View Example

jp-prompt.webm

en-output.webm

zh-output.webm

Accent Control: VALL-E X allows you to control the accent of synthesized audio, such as speaking Chinese with an English accent or vice versa. 🇨🇳 💬

View Example

en-prompt.webm

zh-accent-output.webm

en-accent-output.webm

Acoustic Environment Preservation: When a speaker's recording is made in different acoustic environments, VALL-E X can preserve that acoustic environment, making the synthesized speech sound more natural.

View Example

noise-prompt.webm

noise-output.webm

You can visit our demo page to browse more examples!

💻 Usage in Python

🪑 Basic Usage

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)

# save audio to disk
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)

hamburger.webm

🌎 Multilingual

This VALL-E X implementation supports three languages: English, Chinese, and Japanese. You can specify the language by setting the `language` parameter. By default, the model will automatically detect the language.

text_prompt = """
    チュソクは私のお気に入りの祭りです。 私は数日間休んで、友人や家族との時間を過ごすことができます。
"""
audio_array = generate_audio(text_prompt)

vallex_japanese.webm

Note: Even when mixing multiple languages in a single sentence, VALL-E X can perfectly control accents, but you need to manually tag the language of each sentence so our G2P tool can recognize them.

text_prompt = """
    [EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]
    [ZH]这是历史的开始。 如果您想听更多,请继续。[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')

vallex_codeswitch.webm

📼 Preset Voices

We provide dozens of speaker voices ready to use with VALL-E X! Browse all available voices here.

VALL-E X attempts to match the tone, pitch, emotion, and prosody of the given preset voice. The model also attempts to preserve music, ambient noise, etc.

text_prompt = """
I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.
"""
audio_array = generate_audio(text_prompt, prompt="dingzhen")

smoky.webm

🎙Voice Cloning

VALL-E X supports voice cloning! You can use anyone's voice, a character, or even your own voice to create an audio prompt. When you use that audio prompt, VALL-E X will synthesize text using a similar voice.
You need to provide a 3-10 second speech clip and the corresponding text for that speech to create an audio prompt. You can also leave the text blank and let the Whisper model generate the text for you.

VALL-E X attempts to match the tone, pitch, emotion, and prosody of the given audio prompt. The model also attempts to preserve music, ambient noise, etc.

from utils.prompt_making import make_prompt

### Use given transcript
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav",
                transcript="Just, what was that? Paimon thought we were gonna get eaten.")

### Alternatively, use whisper
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")

Let's try the audio prompt we just made!

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# download and load all models
preload_models()

text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="paimon")

write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)

paimon_prompt.webm

paimon_cloned.webm

🎢User Interface

If you're not comfortable with code, we've also created a user-friendly graphical interface for VALL-E X. It allows you to easily interact with the model, making voice cloning and multilingual speech synthesis a breeze.
Use the following command to launch the user interface:

python -X utf8 launch-ui.py

🛠️ Hardware Requirements and Inference Speed

VALL-E X can run on CPU or GPU (pytorch 2.0+, CUDA 11.7 ~ CUDA 12.0).

If running on GPU, you need at least 6GB of VRAM.

⚙️ Details

VALL-E X is similar to Bark, VALL-E and AudioLM, using GPT-style models to predict quantized audio tokens in an autoregressive manner, decoded by EnCodec.
Compared to Bark:

✔ Lightweight: 3️⃣ ✖ smaller,
✔ Fast: 4️⃣ ✖ faster,
✔ Higher quality for Chinese & Japanese
✔ No foreign accent in cross-lingual synthesis
✔ Open and easy-to-use voice cloning
❌ Fewer supported languages
❌ No tokens for synthesizing music and special sound effects

Supported Languages

Language	Status
English (en)	✅
Japanese (ja)	✅
Chinese (zh)	✅

❓ FAQ

Where can I download the checkpoint?

When you run the program for the first time, we use wget to download the model to the ./checkpoints/ directory.
If the download fails on the first run, please manually download the model from here and place the file in ./checkpoints/.

How much VRAM is needed?

6GB VRAM (GPU VRAM) - almost all NVIDIA GPUs meet the requirements.

Can the model generate long text?

Yes! The model supports long text generation through automatic sentence chunking.

How it works:

Long text is automatically split into sentences
Each sentence is generated separately (keeping each under ~22 seconds)
The audio clips are seamlessly concatenated together

Technical limitation:

The Transformer model has a 22-second processing limit per chunk due to computational complexity
Your voice prompt (3-10 seconds) + each generated sentence should stay under 22 seconds
For very long paragraphs, the system handles this automatically

In practice:

✅ You can generate paragraphs, articles, or long scripts
✅ The UI automatically handles the chunking for you
✅ Just paste your text and click generate!

Example: A 5-minute speech will be split into ~15-20 sentences and generated sequentially.

More...

🧠 TODO

🙏 Acknowledgments

VALL-E X paper for the brilliant idea
lifeiteng's vall-e for related training code
bark for the amazing pioneering work in neuro-codec TTS model

⭐️ Show Your Support

If you find VALL-E X interesting and useful, please give us a star on GitHub! ⭐️ It encourages us to continuously improve the model and add exciting features.

📜 License

VALL-E X uses the MIT License.

Have questions or need help? Feel free to open an issue or join our Discord

Happy voice cloning! 🎤

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
customs		customs
data		data
images		images
models		models
modules		modules
nltk_data/tokenizers/punkt		nltk_data/tokenizers/punkt
presets		presets
prompts		prompts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_m4a_to_wav.py		convert_m4a_to_wav.py
launch-ui.py		launch-ui.py
macros.py		macros.py
model-card.md		model-card.md
requirements.txt		requirements.txt
start_app.ps1		start_app.ps1

Folders and files

Latest commit

History

Repository files navigation

VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning 🔊

📖 Table of Contents

🚀 Updates

💻 Installation

Install using pip, requires Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+

📦 Model Files (Auto-Downloaded)

Manual Download (Optional)

🎧 Online Demo

📢 Features

View Example

View Example

View Example

View Example

View Example

💻 Usage in Python

🪑 Basic Usage

🌎 Multilingual

📼 Preset Voices

🎙Voice Cloning

🎢User Interface

🛠️ Hardware Requirements and Inference Speed

⚙️ Details

Supported Languages

❓ FAQ

Where can I download the checkpoint?

How much VRAM is needed?

Can the model generate long text?

More...

🧠 TODO

🙏 Acknowledgments

⭐️ Show Your Support

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages