Skip to content

Text normalization too aggressive? #31

@tfmorris

Description

@tfmorris

The text normalization in Utils.normalize() seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.

On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.

Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions