-
Notifications
You must be signed in to change notification settings - Fork 53
Open
Description
OS/Arch
system='Linux', node='jclvdell', release='6.8.0-40-generic', version='#40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2', machine='x86_64'
Python version
3.10.12
cChardet version
2.1.7
What is the problem?
A file (attached) with the Euro sign is correctly understood as ISO-8859-15 by the xed editor, but cchardet sees it as ISO-8859-1
Expected behavior
Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou €313,84)
Actual behavior
Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou ¤313,84)
(Euro symbol appears as "¤")
Steps to reproduce the behavior
-
Get this file: pagininha2.html.gz
-
Do this:
$ gunzip pagininha2.html.gz
$ python
>>> import cchardet as chardet
>>> with open("pagininha2.html", "rb") as f:
... msg = f.read()
... result = chardet.detect(msg)
... print(result)
...
{'encoding': 'ISO-8859-1', 'confidence': 0.7640712261199951}
>>>
Metadata
Metadata
Assignees
Labels
No labels