Make cchardet optional #211
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cchardet is not compatible with Python 3.11+ (there is an alpha release that works on 3.11 and 3.12, but it is now old too), but we need to make and keep this package easier to install, update, and secure. Cchardet seemed very dependable for a long time, but its maintainer eventually took a long break. It's pretty frozen, and we cannot depend on it as a hard requirement.
Worth noting on performance: there is another decent pure-Python option that is newer and pretty actively maintained called charset-normalizer. I did a lot of testing with it a year and a half ago (see #165) and it was a little more accurate than chardet was for some uncommon-on-the-web encodings, but about the same when focusing on web/WHATWG encodings and when pre-checking encoding declarations (which we already do). It is slightly faster than chardet on large documents, but notably slower on small documents. We always truncate to 18kB, so we will always see this “small document” case. For our uses, chardet is slightly better. That said, I did that testing a in 2024! Things may have changed, and I hope to do a fresher dive and and writeup on this in the next couple months.
Partially covers #196.