Skip to content

Conversation

@Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Dec 18, 2025

Cchardet is not compatible with Python 3.11+ (there is an alpha release that works on 3.11 and 3.12, but it is now old too), but we need to make and keep this package easier to install, update, and secure. Cchardet seemed very dependable for a long time, but its maintainer eventually took a long break. It's pretty frozen, and we cannot depend on it as a hard requirement.

Worth noting on performance: there is another decent pure-Python option that is newer and pretty actively maintained called charset-normalizer. I did a lot of testing with it a year and a half ago (see #165) and it was a little more accurate than chardet was for some uncommon-on-the-web encodings, but about the same when focusing on web/WHATWG encodings and when pre-checking encoding declarations (which we already do). It is slightly faster than chardet on large documents, but notably slower on small documents. We always truncate to 18kB, so we will always see this “small document” case. For our uses, chardet is slightly better. That said, I did that testing a in 2024! Things may have changed, and I hope to do a fresher dive and and writeup on this in the next couple months.

Partially covers #196.

Cchardet is not compatible with Python 3.11+ (there is an alpha release that works on 3.11 and 3.12, but it is now old too), and we need to make and keep this package easier to install, update, and secure. Cchardet seemed very dependable for a long time, but its maintainer eventually took a long break. It's pretty frozen, and we cannot depend on it as a hard requirement.

Partially covers #196.
It looks like we have Versioneer compatibility issues on 3.12 and newer, and the cchardet alpha does not behave *quite* the same enough to pass tets on 3.11 (it does work, though).
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 18, 2025

Of course, we happen to have a detection test that uses iso-8859-2, which chardet does not support and which charset-normalizer does not detect correctly. 😩

chardet apparently does not support the Bulgarian encoding we were using before. Sadly, neither does charset_normalizer get it right. Use something that works across detectors for testing.
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 18, 2025

This also updates Python support to v3.11. 3.12 and later have a compatibility issue with Versioneer that I need to work on next.

@Mr0grog Mr0grog merged commit 0abd10f into main Dec 18, 2025
14 checks passed
@Mr0grog Mr0grog deleted the cchardet-is-still-speedy-but-is-not-always-limber-enough branch December 18, 2025 22:16
@github-project-automation github-project-automation bot moved this from Inbox to Done in Web Monitoring Dec 18, 2025
@Mr0grog Mr0grog mentioned this pull request Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants