Skip to content

Enhance tokenization and parsing #167

@scossu

Description

@scossu

This ticket is a proposal for major enhancements to text scanning and tokenization.

These enhancements should:

  • Improve flexibility in token splitting, using characters that are meaningful for individual languages;
  • Allow for custom logic to carry out more sophisticated tokenization jobs;
  • Remove the need to hard-code spaces in some transliteration tables, such as Chinese and Korean;
  • More efficiently scan large lists of exceptions, possibly improving transliteration speed significantly for some tables (e.g. Chinese);
  • Keep a section for exception tokens separated from character-by-character transliteration rules in the configuration.

Tasks:

[ ] Introduce a more flexible tokenization model that is configurable per table. Defaults to splitting by spaces if not defined.
[ ] Introduce a new configuration item to specify a range of characters by which a certain language/script shall be tokenized.
[ ] Add event hook to take over tokenization job.
[ ] Add a configuration option to list whole-token exceptions, separated from script_to_roman.map and roman_to_script.map.
[ ] Implement logic to transliterate whole tokens if they are found in the exception list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions