The text normalization in Utils.normalize() seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.
On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.
Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.
The text normalization in Utils.normalize() seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.
On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.
Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.