-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
Description
This ticket is a proposal for major enhancements to text scanning and tokenization.
These enhancements should:
- Improve flexibility in token splitting, using characters that are meaningful for individual languages;
- Allow for custom logic to carry out more sophisticated tokenization jobs;
- Remove the need to hard-code spaces in some transliteration tables, such as Chinese and Korean;
- More efficiently scan large lists of exceptions, possibly improving transliteration speed significantly for some tables (e.g. Chinese);
- Keep a section for exception tokens separated from character-by-character transliteration rules in the configuration.
Tasks:
[ ] Introduce a more flexible tokenization model that is configurable per table. Defaults to splitting by spaces if not defined.
[ ] Introduce a new configuration item to specify a range of characters by which a certain language/script shall be tokenized.
[ ] Add event hook to take over tokenization job.
[ ] Add a configuration option to list whole-token exceptions, separated from script_to_roman.map and roman_to_script.map.
[ ] Implement logic to transliterate whole tokens if they are found in the exception list.