Text normalization too aggressive?

The text normalization in [Utils.normalize() ](https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/Utils.java#L117) seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.

On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.

Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text normalization too aggressive? #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Text normalization too aggressive? #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions