Add new functionalities to BPE lexicon creation#569
Conversation
Co-authored-by: Albert Zeyer <zeyer@cs.rwth-aachen.de>
| :param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab | ||
| useful if you set vocab_blacklist |
There was a problem hiding this comment.
| :param skip_unk_lemmas: whether simply skip lemmas out of the BPE vocab | |
| useful if you set vocab_blacklist | |
| :param skip_unk_lemmas: Whether to simply skip lemmas that are not part of the BPE vocabulary. | |
| Useful if you set vocab_blacklist. |
| :param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes, | ||
| otherwise, only phonemes appear in lexicon lemma will be added to the lexicon. |
There was a problem hiding this comment.
| :param add_all_bpe_phonemes: If set to True, all BPE vocab will be added to lexicon phonemes, | |
| otherwise, only phonemes appear in lexicon lemma will be added to the lexicon. | |
| :param add_all_bpe_phonemes: If set to True, all BPE tokens will be added to lexicon as phonemes, | |
| otherwise, only tokens that appear in the base lexicon will be added to the output lexicon. |
| :param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words, | ||
| e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon |
There was a problem hiding this comment.
| :param additional_words: Aside from vocab specified in base_lexicon, we might want to convert some other words, | |
| e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon | |
| :param additional_words: Aside from the vocabulary specified in base_lexicon, we might want to convert some other words, | |
| e.g. untranslatable words by a g2p model in case of g2p-augmented lexicon |
| for orth in lemma.orth: | ||
| bpe_pron = " ".join([token if token in vocab else self.unk_label for token in w2b[orth].split()]) | ||
| if self.skip_unk_lemmas and self.unk_label in bpe_pron.split(): | ||
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") |
There was a problem hiding this comment.
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") | |
| logging.info(f"Lemma {orth} is skipped due to use of the BPE token for <unknown>.") |
There was a problem hiding this comment.
self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") | |
| logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.") |
There was a problem hiding this comment.
Yeah, we normally filter out non-word(like [noise]) from the text corpora to train BPE, and when we try to convert non-word lemmata in the base phoneme lexicon, they would be unknown to the BPE vocab due to the existence of "[" and "]".
Btw, this PR is now NOT used in Apptek BPE pipeline and will only be a feature for i6 people. I will still try to get it merged, but don't worry that it would slow down our process in Apptek.
| additional_words_list = set() | ||
| if self.additional_words is not None: | ||
| with util.uopen(self.additional_words.get_path(), "rt") as f: | ||
| for line in f: | ||
| line = line.strip() | ||
| additional_words_list.add(line) | ||
| return sorted(additional_words_list) |
There was a problem hiding this comment.
| additional_words_list = set() | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| for line in f: | |
| line = line.strip() | |
| additional_words_list.add(line) | |
| return sorted(additional_words_list) | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| res = {line.strip() for line in f} | |
| else: | |
| res = set() | |
| return sorted(res) |
There was a problem hiding this comment.
| additional_words_list = set() | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| for line in f: | |
| line = line.strip() | |
| additional_words_list.add(line) | |
| return sorted(additional_words_list) | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| return sorted({line.strip() for line in f}) | |
| return [] |
even more simpler
| additional_words_list = set() | ||
| if self.additional_words is not None: | ||
| with util.uopen(self.additional_words.get_path(), "rt") as f: | ||
| for line in f: | ||
| line = line.strip() | ||
| additional_words_list.add(line) | ||
| return sorted(additional_words_list) |
There was a problem hiding this comment.
| additional_words_list = set() | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| for line in f: | |
| line = line.strip() | |
| additional_words_list.add(line) | |
| return sorted(additional_words_list) | |
| if self.additional_words is not None: | |
| with util.uopen(self.additional_words.get_path(), "rt") as f: | |
| return sorted({line.strip() for line in f}) | |
| return [] |
even more simpler
| for orth in lemma.orth: | ||
| bpe_pron = " ".join([token if token in vocab else self.unk_label for token in w2b[orth].split()]) | ||
| if self.skip_unk_lemmas and self.unk_label in bpe_pron.split(): | ||
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") |
There was a problem hiding this comment.
self.unk_label in bpe_pron.split() means there is some string in the word that cannot be represented with the current BPE vocab (e.g. greek letters with an all latin vocab), right?
| logging.info(f"Lemma {orth} is skipped due to unknown BPE vocab.") | |
| logging.info(f"Lemma {orth} is skipped because it cannot be represented with the BPE vocab.") |
| keep_special_lemmas: bool = True, | ||
| skip_unk_lemmas: bool = False, | ||
| add_all_bpe_phonemes: bool = True, | ||
| additional_words: Optional[tk.Path] = None, |
There was a problem hiding this comment.
We could do it in this job, but my feeling is that adding a bunch of words with empty pronunciation is a separate task and I would rather have a CreateEmptyPronunciationLexiconJob or CreateOrthOnlyLexiconJob and then use the MergeLexiconJob to combine those.
| vocab.add(symbol) | ||
| lexicon.add_phoneme(symbol.replace(".", "_")) | ||
| if self.add_all_bpe_phonemes: | ||
| lexicon.add_phoneme(symbol.replace(".", "_")) |
There was a problem hiding this comment.
symbol is already replaced at line 101 symbol = symbol.replace(".", "_") , so you could just lexicon.add_phoneme(symbol)
| continue | ||
| used_vocab.update(set(bpe_pron.split())) | ||
| lexicon.add_lemma(Lemma([orth], [bpe_pron.replace(".", "_")], lemma.synt, lemma.eval)) | ||
|
|
There was a problem hiding this comment.
why if not? Shouldn't we add when phoneme when self.add_all_bpe_phonemes is true?
There was a problem hiding this comment.
if self.add_all_bpe_phonemes is True there is a separate block below that does the adding
In order to use this job in more complicated scenarios, I added some functionalities to CreateBPELexiconJob. The original functionalities and hashes should stay unchanged.