Seeking Fine-tuning Guidance and Experience Sharing

### Checks

- [x] This template is only for usage issues encountered.
- [x] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- [x] I have searched for existing issues, including closed ones, and couldn't find a solution.
- [x] I am using English to submit this issue to facilitate community communication.

### Environment Details

Hi F5-TTS team and community,

I'm currently working on fine-tuning the F5-TTS model on a custom dataset (specifically, a Cantonese dialect dataset). I've followed the provided documentation and successfully completed the training process using the `finetune_cli.py` script.

My training setup is as follows:
- **Base Model:** `F5TTS_v1_Base/model_1250000.safetensors`
- **Dataset:** Cantonese dialect (details on size, hours, etc. can be added if relevant)
- **Training Parameters:** `--epochs 1`, `--dataset_name cantonese`, `--tokenizer_path ./data/cantonese/vocab.txt`, `--pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors`
- **Training Environment:** Google Colab with a GPU (NVIDIA A100-SXM4-80GB)

The training completed with a reported loss of around 0.544.

I'm now moving on to the inference and evaluation steps, but I would greatly appreciate it if anyone could share their experiences and insights on fine-tuning this model, especially for different languages or dialects.

Specifically, I'm interested in:
- Best practices for preparing custom datasets for fine-tuning.
- Expected loss values or metrics to look for during training to indicate successful fine-tuning.
- Tips for optimizing hyperparameters for better performance on a specific dialect.
- Any common issues encountered during fine-tuning and how to resolve them.
- Experiences with the objective evaluation metrics and how they correlate with perceived audio quality.

I'm eager to learn from the community's experience to improve my results and contribute back if possible.

Thank you for your time and support!

Best regards


### Steps to Reproduce


1.  Cloned the F5-TTS repository.
2.  Mounted Google Drive to access the project directory.
3.  Changed the current directory to `/content/drive/MyDrive/AIAA2205-assignment2-F5-TTS`.
4.  Installed dependencies using `pip install -r requirement.txt`.
5.  Prepared the Cantonese dialect dataset using `data_prepare_text.py` with input directory `./data/dialect_data/` and output directory `./data/cantonese/`.
6.  Downloaded the pretrained model `F5TTS_v1_Base/model_1250000.safetensors` using `huggingface_hub`.
7.  Ran the fine-tuning script with the following command: `!CUDA_VISIBLE_DEVICES=0 python ./src/f5_tts/train/finetune_cli.py --epochs 1 --dataset_name cantonese --tokenizer_path ./data/cantonese/vocab.txt --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors`

### ✔️ Expected Behavior

Based on the training completion, I expect that the fine-tuned model should be able to generate speech in the Cantonese dialect using the provided reference audio and text. The generated speech should ideally sound natural and reflect the characteristics of the Cantonese voice from the reference audio.

### ❌ Actual Behavior

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking Fine-tuning Guidance and Experience Sharing #1211

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Seeking Fine-tuning Guidance and Experience Sharing #1211

Description

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions