Make cchardet optional #211

Mr0grog · 2025-12-18T21:21:03Z

Cchardet is not compatible with Python 3.11+ (there is an alpha release that works on 3.11 and 3.12, but it is now old too), but we need to make and keep this package easier to install, update, and secure. Cchardet seemed very dependable for a long time, but its maintainer eventually took a long break. It's pretty frozen, and we cannot depend on it as a hard requirement.

Worth noting on performance: there is another decent pure-Python option that is newer and pretty actively maintained called charset-normalizer. I did a lot of testing with it a year and a half ago (see #165) and it was a little more accurate than chardet was for some uncommon-on-the-web encodings, but about the same when focusing on web/WHATWG encodings and when pre-checking encoding declarations (which we already do). It is slightly faster than chardet on large documents, but notably slower on small documents. We always truncate to 18kB, so we will always see this “small document” case. For our uses, chardet is slightly better. That said, I did that testing a in 2024! Things may have changed, and I hope to do a fresher dive and and writeup on this in the next couple months.

Partially covers #196.

Cchardet is not compatible with Python 3.11+ (there is an alpha release that works on 3.11 and 3.12, but it is now old too), and we need to make and keep this package easier to install, update, and secure. Cchardet seemed very dependable for a long time, but its maintainer eventually took a long break. It's pretty frozen, and we cannot depend on it as a hard requirement. Partially covers #196.

It looks like we have Versioneer compatibility issues on 3.12 and newer, and the cchardet alpha does not behave *quite* the same enough to pass tets on 3.11 (it does work, though).

Mr0grog · 2025-12-18T21:57:06Z

Of course, we happen to have a detection test that uses iso-8859-2, which chardet does not support and which charset-normalizer does not detect correctly. 😩

chardet apparently does not support the Bulgarian encoding we were using before. Sadly, neither does charset_normalizer get it right. Use something that works across detectors for testing.

Mr0grog · 2025-12-18T22:14:35Z

This also updates Python support to v3.11. 3.12 and later have a compatibility issue with Versioneer that I need to work on next.

Mr0grog added this to Web Monitoring Dec 18, 2025

github-project-automation bot moved this to Inbox in Web Monitoring Dec 18, 2025

Mr0grog added 3 commits December 18, 2025 13:24

Test on newer Pythons

0dbb042

Cache correctly in CI

ce910d2

Drop Python 3.12+, use faust-cchardet on 3.11

7c8553d

It looks like we have Versioneer compatibility issues on 3.12 and newer, and the cchardet alpha does not behave *quite* the same enough to pass tets on 3.11 (it does work, though).

Test detection with something that is supported

2d3012b

chardet apparently does not support the Bulgarian encoding we were using before. Sadly, neither does charset_normalizer get it right. Use something that works across detectors for testing.

Mr0grog merged commit 0abd10f into main Dec 18, 2025
14 checks passed

Mr0grog deleted the cchardet-is-still-speedy-but-is-not-always-limber-enough branch December 18, 2025 22:16

github-project-automation bot moved this from Inbox to Done in Web Monitoring Dec 18, 2025

Mr0grog mentioned this pull request Dec 18, 2025

Support Python 3.12-3.14 #212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make cchardet optional #211

Make cchardet optional #211

Uh oh!

Mr0grog commented Dec 18, 2025

Uh oh!

Mr0grog commented Dec 18, 2025

Uh oh!

Mr0grog commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Make cchardet optional #211

Make cchardet optional #211

Uh oh!

Conversation

Mr0grog commented Dec 18, 2025

Uh oh!

Mr0grog commented Dec 18, 2025

Uh oh!

Mr0grog commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants