Skip to content

Fix tokenising when using using more than just a-zA-Z#37

Open
robotdana wants to merge 1 commit intomyint:masterfrom
robotdana:diacritics
Open

Fix tokenising when using using more than just a-zA-Z#37
robotdana wants to merge 1 commit intomyint:masterfrom
robotdana:diacritics

Conversation

@robotdana
Copy link

@robotdana robotdana commented Nov 30, 2018

Previously: Händler would be tokenized as ndler or ändler depending on python version
Rather than the expected händler

Solution: use regexp rather than re.
This gives us the ability to use unicode character clasess such as [[:upper:]] and [[:lower:]]

Fixes #35

I'm usually a ruby developer not a python developer I don't know how to get the regex library working on 2.7 or how to compare the test strings in a unicode-aware way (they're different on my mac vs on travis, if one passes the other fails)

But it mostly works

@robotdana robotdana force-pushed the diacritics branch 3 times, most recently from 57da098 to c8bd64d Compare November 30, 2018 02:52
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

Fixes myint#35
@myint
Copy link
Owner

myint commented Dec 23, 2018

Thanks! I haven't tried the regex module before. I'll take a look when I have more time.

@robotdana
Copy link
Author

robotdana commented Sep 22, 2019

If you're interested, i took the really long way round fixing this by creating my own spell checker https://github.com/robotdana/spellr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scspell splits words tokens with diacritics inside words

2 participants