Here you can find the data and scripts used to develop my undergraduate thesis titled "Creating a Dataset for Automatic Phonetic Transcription in Brazilian Portuguese". We used the CORAA ASR corpus (Candido Junior, 2022) and FalaBrasil's G2P converter (Neto, 2011) to create a dataset of automatic phonetic transcriptions for training Automatic Phonetic Transcription (APT) models for Brazilian Portuguese (PT-BR). The phonetic transcriptions were standardized according to the phoneme charts presented by (Ivo, 2019a; Ivo, 2019b), ensuring conformity with the PT-BR phonology literature.
We share the phonetic transcriptions for CORAA ASR's train, dev, and test set, alongside three subsets of the train set (1h, 10h, and 60h of audio), one subset of the dev test (1h of audio), and one subset of the test set (1h of audio).
The datasets are available at data_and_configs/wav2vec2_phoneme_*_test/input/.
The CORAA ASR corpus is availabe at nilc-nlp/CORAA.
Furthermore, we fine-tuned three wav2vec 2.0 models, which achieved the following PER (Phonetic Error Rates):
| Subset | Dev | Test |
|---|---|---|
| 1h | 0.8301 | 0.7963 |
| 10h | 0.2197 | 0.1587 |
| 60h | 0.2190 | 0.1600 |
Folder configs/ contains the configuration files used for fine-tuning. A small description of the input files fields is given below:
| file_path | g2p | g2p_ipa | transcript_ipa | transcript_encoded |
|---|---|---|---|---|
| File path | Raw G2P transcript | G2P transcript in IPA | Standardized G2P transcript in IPA | Encoded transcript used in the fine-tuning |
Additionally, we share the APT model fine-tuned on 10 hours of audio in the Hugging Face repository.