This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Description
If I understand correctly, Sudachi is a lattice-based tokenizer and uses the occurrence probabilities and left-right probabilities (costs) for finding the best token sequence.
We would like to know whether we could customize these cost values. I imagine that in a niche domain like biomedicine with many unknown bacteria/disease names, we need domain-specific values to have the best tokenizer.