A deep learning system for real-time voice cloning. This project provides an implementation of a neural voice cloning system that allows you to clone a voice from a few seconds of audio and generate speech in that voice.
This voice cloning system consists of three independent components:
- Encoder - A speaker encoder that generates a fixed-length embedding vector from a few seconds of speech
- Synthesizer - A sequence-to-sequence model that converts text to a mel spectrogram, conditioned on the speaker embedding
- Vocoder - A neural vocoder that converts mel spectrograms to waveforms
The project is based on the SV2TTS (Speaker Voice to Text-to-Speech) architecture, which enables few-shot voice cloning.
- Clone a voice from as little as 5 seconds of audio
- Synthesize speech in real-time
- User-friendly toolbox with GUI for quick experimentation
- Command-line interface for batch processing
- Pretrained models included
- Full training pipeline for custom models
- Python 3.6 or higher
- PyTorch 1.0 or higher
- CUDA (optional for GPU support)
-
Clone the repository
git clone https://github.com/yourusername/Voice-Cloning.git cd Voice-Cloning -
Create a virtual environment (optional but recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\\Scripts\\activate
-
Install the required packages
pip install -r requirements.txt
-
Download Pretrained Models
The pretrained models are too large for GitHub and must be downloaded separately. Please download them from the Google Drive link below:
After downloading, create a directory structure as follows:
saved_models/ default/ encoder.pt (~ 16 MB) synthesizer.pt (~ 353 MB) vocoder.pt (~ 51 MB)Place the downloaded model files in the appropriate directory as shown above.
The toolbox provides a graphical interface that allows you to:
- Record or load utterances to clone a voice
- Synthesize speech from text with the cloned voice
- Visualize speaker embeddings and spectrograms
To launch the toolbox:
python demo_toolbox.pyOptional arguments:
--cpu: Use CPU for inference (default uses GPU if available)--seed: Set a random seed for deterministic results--models_dir: Path to the directory containing models (default: saved_models)--datasets_root: Path to datasets directory (default: dataset)
For batch processing or scripted usage, use the CLI:
python demo_cli.py --text "Hello, this is a test." --weights_path saved_models/default/-
Prepare datasets for the encoder:
python encoder_preprocess.py --datasets_root=<datasets_root> --datasets=<dataset1,dataset2,...>
-
Prepare datasets for the synthesizer:
python synthesizer_preprocess_audio.py --datasets_root=<datasets_root> --datasets=<dataset1,dataset2,...> python synthesizer_preprocess_embeds.py --synthesizer_root=<synthesizer_output_dir> --encoder_model_fpath=<encoder_model.pt>
-
Prepare datasets for the vocoder:
python vocoder_preprocess.py --datasets_root=<datasets_root>
-
Train the encoder:
python encoder_train.py --run_id=<run_name> --clean_data_root=<encoder_dataset_root>
-
Train the synthesizer:
python synthesizer_train.py <run_id> <synthesizer_dataset_root> --models_dir=<models_dir>
-
Train the vocoder:
python vocoder_train.py <run_id> <vocoder_dataset_root> --models_dir=<models_dir>
The encoder is based on the GE2E (Generalized End-to-End) loss model, which maps variable-length speech utterances to fixed-length embeddings that capture speaker characteristics.
The synthesizer is based on Tacotron 2, a sequence-to-sequence model with attention that generates mel spectrograms from text, conditioned on speaker embeddings.
The vocoder is based on WaveRNN, which generates high-quality waveforms from mel spectrograms in real-time.
This repository typically includes pretrained models, but due to GitHub file size limitations, they are now hosted separately. Please see the Installation section for download instructions.
Model specifications:
- Encoder trained on LibriSpeech and VoxCeleb datasets
- Synthesizer trained on the LibriSpeech dataset
- WaveRNN vocoder trained on the LibriSpeech dataset
The audio files in the samples folder are provided for toolbox testing and benchmarking purposes. These are the same reference utterances used by the SV2TTS authors to generate the audio samples.
The p240_00000.mp3 and p260_00000.mp3 files are compressed versions of audios from the VCTK corpus.
The 1320_00000.mp3, 3575_00000.mp3, 6829_00000.mp3 and 8230_00000.mp3 files are compressed versions of audios from the LibriSpeech dataset.
This project is licensed under the MIT License - see the LICENSE file for details.
- This project is based on the SV2TTS framework
- The implementation is inspired by research from: