Hi.
Thanks for the great code. I wanted to comment a bit on the internals, not 100% certain though:
- If you use cosine scheduler for flow matching, you need to use it both in inference and training probably, see cosyvoice:
https://github.com/KdaiP/StableTTS/blob/main/models/flow_matching.py#L89
https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/flow/flow_matching.py#L67
- CFG only applies to diffusion, it is not quite correct to add it to the encoder as you use CFG to add noise to speakers, it degrades loss a bit but doesn't really add the quality.
https://github.com/KdaiP/StableTTS/blob/main/models/model.py#L141
Hi.
Thanks for the great code. I wanted to comment a bit on the internals, not 100% certain though:
https://github.com/KdaiP/StableTTS/blob/main/models/flow_matching.py#L89
https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/flow/flow_matching.py#L67
https://github.com/KdaiP/StableTTS/blob/main/models/model.py#L141