unique_text_tokens.k2symbols for non-english languages

Hello everyone,
I've noticed that throughout the pipeline, [unknown tokens are removed](https://github.com/PolyAI-LDN/pheme/blob/main/data/semantic_dataset.py#L122), and that the `unique_text_tokens.k2symbols` doesn't contém all necessary phonemes for Non-English languages, such as accents and other diacritics.

I'm training to train _pheme_ in Portuguese, and I was wondering what I should do so the model can understand the accents of my language. Any tips on how to do it?

P.S.: I've also changed the phonemizer backend, so it could generate phonemes in PT-BR. `espeak` is available in PT-BR, so it was a no-brainer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unique_text_tokens.k2symbols for non-english languages #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unique_text_tokens.k2symbols for non-english languages #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions