Checks
Environment Details
Hi F5-TTS team and community,
I'm currently working on fine-tuning the F5-TTS model on a custom dataset (specifically, a Cantonese dialect dataset). I've followed the provided documentation and successfully completed the training process using the finetune_cli.py script.
My training setup is as follows:
- Base Model:
F5TTS_v1_Base/model_1250000.safetensors
- Dataset: Cantonese dialect (details on size, hours, etc. can be added if relevant)
- Training Parameters:
--epochs 1, --dataset_name cantonese, --tokenizer_path ./data/cantonese/vocab.txt, --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors
- Training Environment: Google Colab with a GPU (NVIDIA A100-SXM4-80GB)
The training completed with a reported loss of around 0.544.
I'm now moving on to the inference and evaluation steps, but I would greatly appreciate it if anyone could share their experiences and insights on fine-tuning this model, especially for different languages or dialects.
Specifically, I'm interested in:
- Best practices for preparing custom datasets for fine-tuning.
- Expected loss values or metrics to look for during training to indicate successful fine-tuning.
- Tips for optimizing hyperparameters for better performance on a specific dialect.
- Any common issues encountered during fine-tuning and how to resolve them.
- Experiences with the objective evaluation metrics and how they correlate with perceived audio quality.
I'm eager to learn from the community's experience to improve my results and contribute back if possible.
Thank you for your time and support!
Best regards
Steps to Reproduce
- Cloned the F5-TTS repository.
- Mounted Google Drive to access the project directory.
- Changed the current directory to
/content/drive/MyDrive/AIAA2205-assignment2-F5-TTS.
- Installed dependencies using
pip install -r requirement.txt.
- Prepared the Cantonese dialect dataset using
data_prepare_text.py with input directory ./data/dialect_data/ and output directory ./data/cantonese/.
- Downloaded the pretrained model
F5TTS_v1_Base/model_1250000.safetensors using huggingface_hub.
- Ran the fine-tuning script with the following command:
!CUDA_VISIBLE_DEVICES=0 python ./src/f5_tts/train/finetune_cli.py --epochs 1 --dataset_name cantonese --tokenizer_path ./data/cantonese/vocab.txt --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors
✔️ Expected Behavior
Based on the training completion, I expect that the fine-tuned model should be able to generate speech in the Cantonese dialect using the provided reference audio and text. The generated speech should ideally sound natural and reflect the characteristics of the Cantonese voice from the reference audio.
❌ Actual Behavior
No response
Checks
Environment Details
Hi F5-TTS team and community,
I'm currently working on fine-tuning the F5-TTS model on a custom dataset (specifically, a Cantonese dialect dataset). I've followed the provided documentation and successfully completed the training process using the
finetune_cli.pyscript.My training setup is as follows:
F5TTS_v1_Base/model_1250000.safetensors--epochs 1,--dataset_name cantonese,--tokenizer_path ./data/cantonese/vocab.txt,--pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensorsThe training completed with a reported loss of around 0.544.
I'm now moving on to the inference and evaluation steps, but I would greatly appreciate it if anyone could share their experiences and insights on fine-tuning this model, especially for different languages or dialects.
Specifically, I'm interested in:
I'm eager to learn from the community's experience to improve my results and contribute back if possible.
Thank you for your time and support!
Best regards
Steps to Reproduce
/content/drive/MyDrive/AIAA2205-assignment2-F5-TTS.pip install -r requirement.txt.data_prepare_text.pywith input directory./data/dialect_data/and output directory./data/cantonese/.F5TTS_v1_Base/model_1250000.safetensorsusinghuggingface_hub.!CUDA_VISIBLE_DEVICES=0 python ./src/f5_tts/train/finetune_cli.py --epochs 1 --dataset_name cantonese --tokenizer_path ./data/cantonese/vocab.txt --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors✔️ Expected Behavior
Based on the training completion, I expect that the fine-tuned model should be able to generate speech in the Cantonese dialect using the provided reference audio and text. The generated speech should ideally sound natural and reflect the characteristics of the Cantonese voice from the reference audio.
❌ Actual Behavior
No response