Enhance tokenization and parsing

This ticket is a proposal for major enhancements to text scanning and tokenization. 

These enhancements should: 

- Improve flexibility in token splitting, using characters that are meaningful for individual languages;
- Allow for custom logic to carry out more sophisticated tokenization jobs;
- Remove the need to hard-code spaces in some transliteration tables, such as Chinese and Korean;
- More efficiently scan large lists of exceptions, possibly improving transliteration speed significantly for some tables (e.g. Chinese);
- Keep a section for exception tokens separated from character-by-character transliteration rules in the configuration.

Tasks: 

[ ] Introduce a more flexible tokenization model that is configurable per table. Defaults to splitting by spaces if not defined.
[ ] Introduce a new configuration item to specify a range of characters by which a certain language/script shall be tokenized.
[ ] Add event hook to take over tokenization job.
[ ] Add a configuration option to list whole-token exceptions, separated from `script_to_roman.map` and `roman_to_script.map`.
[ ] Implement logic to transliterate whole tokens if they are found in the exception list.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance tokenization and parsing #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhance tokenization and parsing #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions