-
-
Notifications
You must be signed in to change notification settings - Fork 34.4k
lib: correct windows-1252 decoding in TextDecoder #60737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
lib: correct windows-1252 decoding in TextDecoder #60737
Conversation
9226e25 to
d57a575
Compare
d57a575 to
ba71243
Compare
The TextDecoder was incorrectly using the Latin-1 fast path for windows-1252 encoding, which caused incorrect decoding of bytes in the 0x80-0x9F range. The issue occurs because windows-1252 differs from ISO-8859-1 (Latin-1) in this byte range. The simdutf library's convert_latin1_to_utf8 function directly maps bytes to Unicode codepoints (e.g., 0x92 → U+0092), which is correct for ISO-8859-1 but incorrect for windows-1252, where 0x92 should map to U+2019 (RIGHT SINGLE QUOTATION MARK '). This fix disables the Latin-1 fast path for windows-1252, forcing the decoder to use the ICU converter which correctly handles the windows-1252 specific character mappings according to the WHATWG Encoding Standard. The fix includes comprehensive tests for all 32 affected characters (bytes 0x80-0x9F) to prevent regression. Fixes: nodejs#56542 Refs: https://encoding.spec.whatwg.org/#windows-1252
ba71243 to
24a2207
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #60737 +/- ##
==========================================
+ Coverage 88.53% 88.56% +0.02%
==========================================
Files 703 703
Lines 208226 208254 +28
Branches 40145 40172 +27
==========================================
+ Hits 184352 184437 +85
+ Misses 15884 15822 -62
- Partials 7990 7995 +5
🚀 New features to boost your workflow:
|
The Latin-1 fast path was incorrectly enabled only for windows-1252 encoding, which differs from ISO-8859-1 (Latin-1) in the 0x80-0x9F range. Since windows-1252 cannot use the Latin-1 fast path (it requires different character mappings via ICU), and no other encoding uses it, the entire Latin-1 fast path mechanism has been removed. This simplifies the code while fixing the windows-1252 decoding issue. Windows-1252 now correctly uses the ICU decoder for all characters. Fixes: nodejs#56542
Remove the decodeLatin1 import from encoding_binding as it is no longer used after disabling the Latin-1 fast path for Windows-1252.
|
decodeLatin1 has no usage outside of this and should be just removed, it was added under a wrong assumption i noticed this PR just now, but I already filed #60889 Tests here are useful though! |
|
This can be closed, #61118 landed, which removed the broken codepath. |
Description
Fixes #56542
This PR corrects the Windows-1252 decoding in
TextDecoderby disabling the Latin-1 fast path for Windows-1252 encoding.Problem
The
TextDecoderwas incorrectly using the Latin-1 fast path (via simdutf'sconvert_latin1_to_utf8) for Windows-1252 encoding. This caused incorrect decoding of bytes in the 0x80-0x9F range.The root cause is that Windows-1252 differs from ISO-8859-1 (Latin-1) in this byte range:
Solution
Disable the Latin-1 fast path for Windows-1252 by setting
this[kLatin1FastPath] = false. This forces the decoder to use the ICU converter (getConverter()), which correctly handles Windows-1252 character mappings according to the WHATWG Encoding Standard.Changes
lib/internal/encoding.jsline 423 to disable Latin-1 fast pathTest Coverage
The new test file
test/parallel/test-whatwg-encoding-custom-windows-1252.jsverifies:References