Skip to content

Bump chardet from 5.2.0 to 6.0.0.post1#584

Merged
Bachmann1234 merged 1 commit intomainfrom
dependabot/pip/chardet-6.0.0.post1
Mar 1, 2026
Merged

Bump chardet from 5.2.0 to 6.0.0.post1#584
Bachmann1234 merged 1 commit intomainfrom
dependabot/pip/chardet-6.0.0.post1

Conversation

@dependabot
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Mar 1, 2026

Bumps chardet from 5.2.0 to 6.0.0.post1.

Release notes

Sourced from chardet's releases.

6.0.0.post1

  • Fixed version number in chardet/version.py still being set to 6.0.0dev0. Otherwise identical to 6.0.0.

6.0.0

Features

  • Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case Latin1Prober and MacRomanProber heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.
  • 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
  • EncodingEra filtering: New encoding_era parameter to detect allows filtering by an EncodingEra flag enum (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL) allows callers to restrict detection to encodings from a specific era. detect() and detect_all() default to MODERN_WEB. The new MODERN_WEB default should drastically improve accuracy for users who are not working with legacy data. The tiers are:
    • MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)
    • LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)
    • LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)
    • LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)
    • DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)
    • MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
  • --encoding-era CLI flag: The chardetect CLI now accepts -e/--encoding-era to control which encoding eras are considered during detection.
  • max_bytes and chunk_size parameters: detect(), detect_all(), and UniversalDetector now accept max_bytes (default 200KB) and chunk_size (default 64KB) parameters for controlling how much data is examined. (#314, @​bysiber)
  • Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
  • Charset metadata registry: New chardet.metadata.charsets module provides structured metadata about all supported encodings, including their era classification and language filter.
  • should_rename_legacy now defaults intelligently: When set to None (the new default), legacy renaming is automatically enabled when encoding_era is MODERN_WEB.
  • Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.
  • EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
  • Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.
  • Python 3.12, 3.13, and 3.14 support (#283, @​hugovk; #311)
  • GitHub Codespace support (#312, @​oxygen-dioxide)

Fixes

  • Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#268, @​nenw)
  • Fix SJIS distribution analysis: Fixed SJISDistributionAnalysis discarding valid second-byte range >= 0x80. (#315, @​bysiber)
  • Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a MIN_RATIO threshold alongside the existing EXPECTED_RATIO.
  • Fix get_charset crash: Resolved a crash when looking up unknown charset names.
  • Fix GB18030 char_len_table: Corrected the character length table for GB18030 multi-byte sequences.
  • Fix UTF-8 state machine: Updated to be more spec-compliant.
  • Fix detect_all() returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.
  • Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.
  • Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.

Breaking changes

  • Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#283, @​hugovk)
  • Removed Latin1Prober and MacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by SingleByteCharSetProber with trained language models, giving better accuracy and language identification.
  • Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.
  • LanguageFilter.NONE removed: Use specific language filters or LanguageFilter.ALL instead.
  • Enum types changed: InputState, ProbingState, MachineState, SequenceLikelihood, and CharacterCategory are now IntEnum (previously plain classes or Enum). LanguageFilter values changed from hardcoded hex to auto().
  • detect() default behavior change: detect() now defaults to encoding_era=EncodingEra.MODERN_WEB and should_rename_legacy=None (auto-enabled for MODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.

Misc changes

  • Switched from Poetry/setuptools to uv + hatchling: Build system modernized with hatch-vcs for version management.

... (truncated)

Commits
  • 2fa72d8 Update version to 6.0.0.post1
  • 8a4636b docs: modernize usage examples and reorganize table of contents
  • 20da71e docs: fix copyright start year and remove first-person reference
  • b45ae91 docs: update copyright to 2015-2026 chardet contributors
  • 3f9910d Add .readthedocs.yaml to fix RTD builds
  • 7ef7cd0 Fix pyright type errors in chardetect.py and test.py
  • 4025dfa Update documentation for 6.0.0 release
  • 1170829 Add LEGACY_REGIONAL encoding era and reclassify misplaced encodings
  • 19379ac Add --encoding-era CLI flag and improve heuristic selection
  • 61308e2 Pre-release fixes: bump to 6.0.0, fix get_charset crash, cleanup
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [chardet](https://github.com/chardet/chardet) from 5.2.0 to 6.0.0.post1.
- [Release notes](https://github.com/chardet/chardet/releases)
- [Commits](chardet/chardet@5.2.0...6.0.0.post1)

---
updated-dependencies:
- dependency-name: chardet
  dependency-version: 6.0.0.post1
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Mar 1, 2026
@Bachmann1234 Bachmann1234 merged commit ba1ae85 into main Mar 1, 2026
26 checks passed
@Bachmann1234 Bachmann1234 deleted the dependabot/pip/chardet-6.0.0.post1 branch March 1, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant