-
Notifications
You must be signed in to change notification settings - Fork 611
Open
Description
Hello,
I have been working on training a FastSpeech2 model for the Malagasy language and encountered issues with the output quality. The synthesized voice is unintelligible despite successfully completing the training process. Below is an outline of the steps I've taken and the model configuration.
Steps Taken:
- Created a corpus of Malagasy (~19 hours of audio).
- Aligned the data using the Montreal Forced Aligner (MFA).
- Used a custom text cleaner for the Malagasy language.
- Ran the prepare_align and preprocess steps successfully.
- Modified the pinyin.py and cmudict.py files to add Malagasy phonemes.
- Trained the model for 21,000 steps.
Using HiFi-GAN as the vocoder with the universal speaker setting.
Configured pitch and energy features at the phoneme level with normalization set to true.
Pitch Losses ranged from 1.1 to 5.17.
Energy Losses ranged from 0.55 to 0.9.
Could the unintelligibility be caused by high pitch loss during training? If so, what would be the best way to address this in terms of configuration or data preparation?
Metadata
Metadata
Assignees
Labels
No labels