Brazilian Portuguese Quick APT: Quick Automatic Phonetic Transcription for Brazilian Portuguese

Here you can find the data and scripts used to develop my undergraduate thesis titled "Creating a Dataset for Automatic Phonetic Transcription in Brazilian Portuguese". We used the CORAA ASR corpus (Candido Junior, 2022) and FalaBrasil's G2P converter (Neto, 2011) to create a dataset of automatic phonetic transcriptions for training Automatic Phonetic Transcription (APT) models for Brazilian Portuguese (PT-BR). The phonetic transcriptions were standardized according to the phoneme charts presented by (Ivo, 2019a; Ivo, 2019b), ensuring conformity with the PT-BR phonology literature.

We share the phonetic transcriptions for CORAA ASR's train, dev, and test set, alongside three subsets of the train set (1h, 10h, and 60h of audio), one subset of the dev test (1h of audio), and one subset of the test set (1h of audio).

Datasets

The datasets are available at data_and_configs/wav2vec2_phoneme_*_test/input/.

Source data

The CORAA ASR corpus is availabe at nilc-nlp/CORAA.

wav2vec 2.0 models

Furthermore, we fine-tuned three wav2vec 2.0 models, which achieved the following PER (Phonetic Error Rates):

Subset	Dev	Test
1h	0.8301	0.7963
10h	0.2197	0.1587
60h	0.2190	0.1600

Folder configs/ contains the configuration files used for fine-tuning. A small description of the input files fields is given below:

file_path	g2p	g2p_ipa	transcript_ipa	transcript_encoded
File path	Raw G2P transcript	G2P transcript in IPA	Standardized G2P transcript in IPA	Encoded transcript used in the fine-tuning

Additionally, we share the APT model fine-tuned on 10 hours of audio in the Hugging Face repository.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data_and_configs		data_and_configs
images		images
notebooks		notebooks
scripts		scripts
README.md		README.md
transcripts.zip		transcripts.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brazilian Portuguese Quick APT: Quick Automatic Phonetic Transcription for Brazilian Portuguese

Datasets

Source data

wav2vec 2.0 models

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Brazilian Portuguese Quick APT: Quick Automatic Phonetic Transcription for Brazilian Portuguese

Datasets

Source data

wav2vec 2.0 models

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages