Voice Cloning

A deep learning system for real-time voice cloning. This project provides an implementation of a neural voice cloning system that allows you to clone a voice from a few seconds of audio and generate speech in that voice.

Project Overview

This voice cloning system consists of three independent components:

Encoder - A speaker encoder that generates a fixed-length embedding vector from a few seconds of speech
Synthesizer - A sequence-to-sequence model that converts text to a mel spectrogram, conditioned on the speaker embedding
Vocoder - A neural vocoder that converts mel spectrograms to waveforms

The project is based on the SV2TTS (Speaker Voice to Text-to-Speech) architecture, which enables few-shot voice cloning.

Features

Clone a voice from as little as 5 seconds of audio
Synthesize speech in real-time
User-friendly toolbox with GUI for quick experimentation
Command-line interface for batch processing
Pretrained models included
Full training pipeline for custom models

Installation

Requirements

Python 3.6 or higher
PyTorch 1.0 or higher
CUDA (optional for GPU support)

Setup

Clone the repository

git clone https://github.com/yourusername/Voice-Cloning.git
cd Voice-Cloning

Create a virtual environment (optional but recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate

Install the required packages
```
pip install -r requirements.txt
```
Download Pretrained Models

The pretrained models are too large for GitHub and must be downloaded separately. Please download them from the Google Drive link below:

Download Pretrained Models

After downloading, create a directory structure as follows:
```
saved_models/
   default/
      encoder.pt     (~ 16 MB)
      synthesizer.pt (~ 353 MB)
      vocoder.pt     (~ 51 MB)
```
Place the downloaded model files in the appropriate directory as shown above.

Usage

Using the Toolbox (GUI)

The toolbox provides a graphical interface that allows you to:

Record or load utterances to clone a voice
Synthesize speech from text with the cloned voice
Visualize speaker embeddings and spectrograms

To launch the toolbox:

python demo_toolbox.py

Optional arguments:

--cpu: Use CPU for inference (default uses GPU if available)
--seed: Set a random seed for deterministic results
--models_dir: Path to the directory containing models (default: saved_models)
--datasets_root: Path to datasets directory (default: dataset)

Using the Command Line Interface

For batch processing or scripted usage, use the CLI:

python demo_cli.py --text "Hello, this is a test." --weights_path saved_models/default/

Training Your Own Models

Data Preparation

Prepare datasets for the encoder:

python encoder_preprocess.py --datasets_root=<datasets_root> --datasets=<dataset1,dataset2,...>

Prepare datasets for the synthesizer:

python synthesizer_preprocess_audio.py --datasets_root=<datasets_root> --datasets=<dataset1,dataset2,...>
python synthesizer_preprocess_embeds.py --synthesizer_root=<synthesizer_output_dir> --encoder_model_fpath=<encoder_model.pt>

Prepare datasets for the vocoder:

python vocoder_preprocess.py --datasets_root=<datasets_root>

Training

Train the encoder:

python encoder_train.py --run_id=<run_name> --clean_data_root=<encoder_dataset_root>

Train the synthesizer:

python synthesizer_train.py <run_id> <synthesizer_dataset_root> --models_dir=<models_dir>

Train the vocoder:

python vocoder_train.py <run_id> <vocoder_dataset_root> --models_dir=<models_dir>

Architecture

Encoder

The encoder is based on the GE2E (Generalized End-to-End) loss model, which maps variable-length speech utterances to fixed-length embeddings that capture speaker characteristics.

Synthesizer

The synthesizer is based on Tacotron 2, a sequence-to-sequence model with attention that generates mel spectrograms from text, conditioned on speaker embeddings.

Vocoder

The vocoder is based on WaveRNN, which generates high-quality waveforms from mel spectrograms in real-time.

Pretrained Models

This repository typically includes pretrained models, but due to GitHub file size limitations, they are now hosted separately. Please see the Installation section for download instructions.

Model specifications:

Encoder trained on LibriSpeech and VoxCeleb datasets
Synthesizer trained on the LibriSpeech dataset
WaveRNN vocoder trained on the LibriSpeech dataset

Sample Audio Files

The audio files in the samples folder are provided for toolbox testing and benchmarking purposes. These are the same reference utterances used by the SV2TTS authors to generate the audio samples.

The p240_00000.mp3 and p260_00000.mp3 files are compressed versions of audios from the VCTK corpus. The 1320_00000.mp3, 3575_00000.mp3, 6829_00000.mp3 and 8230_00000.mp3 files are compressed versions of audios from the LibriSpeech dataset.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project is based on the SV2TTS framework
The implementation is inspired by research from:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Cloning

Project Overview

Features

Installation

Requirements

Setup

Usage

Using the Toolbox (GUI)

Using the Command Line Interface

Training Your Own Models

Data Preparation

Training

Architecture

Encoder

Synthesizer

Vocoder

Pretrained Models

Sample Audio Files

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
WavFiles		WavFiles
encoder		encoder
samples		samples
synthesizer		synthesizer
toolbox		toolbox
utils		utils
vocoder		vocoder
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
demo_cli.py		demo_cli.py
demo_toolbox.py		demo_toolbox.py
encoder_preprocess.py		encoder_preprocess.py
encoder_train.py		encoder_train.py
synthesizer_preprocess_audio.py		synthesizer_preprocess_audio.py
synthesizer_preprocess_embeds.py		synthesizer_preprocess_embeds.py
synthesizer_train.py		synthesizer_train.py
vocoder_preprocess.py		vocoder_preprocess.py
vocoder_train.py		vocoder_train.py

Folders and files

Latest commit

History

Repository files navigation

Voice Cloning

Project Overview

Features

Installation

Requirements

Setup

Usage

Using the Toolbox (GUI)

Using the Command Line Interface

Training Your Own Models

Data Preparation

Training

Architecture

Encoder

Synthesizer

Vocoder

Pretrained Models

Sample Audio Files

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages