Is there an existing issue for this?
Midnight Commander version and build configuration
Operating system
Is this issue reproducible using the latest version of Midnight Commander?
How to reproduce
mcedit, fully UTF-8 environment:
Try to enter a non-printable Unicode character such as U+FEFF, U+FFFE, U+FFFF, U+1FC00, U+10FFFE, U+10FFFF.
Typically you can enter these from the keyboard using Ctrl+Shift+U, then the hex code, then enter or space. This might depend on the OS, desktop, terminal emulator.
Alternatively, copy-paste from somewhere (e.g. a graphical text editor, web browser).
Expected behavior
The codepoint should be inserted to the file, and shown in the UI as one replacement symbol.
Actual behavior
All but the last byte of the UTF-8 sequence is inserted; in turn, showing up as multiple replacement symbols.
For example, U+FEFF "BOM" in UTF-8 is three bytes: ef bb bf. Instead, only the first two: ef bb is inserted. This shows up as two replacement symbols (dot on black background, or similar).
Similarly, the highest valid Unicode character U+10FFFF is four bytes: f4 8f bf bf. Instead, only the first three bytes f4 8f bf are inserted to the file, showing up as three replacement symbols.
Additional context
slang and ncurses builds are both affected, so the problem is probably not there.
Is there an existing issue for this?
Midnight Commander version and build configuration
Operating system
Is this issue reproducible using the latest version of Midnight Commander?
How to reproduce
mcedit, fully UTF-8 environment:
Try to enter a non-printable Unicode character such as U+FEFF, U+FFFE, U+FFFF, U+1FC00, U+10FFFE, U+10FFFF.
Typically you can enter these from the keyboard using Ctrl+Shift+U, then the hex code, then enter or space. This might depend on the OS, desktop, terminal emulator.
Alternatively, copy-paste from somewhere (e.g. a graphical text editor, web browser).
Expected behavior
The codepoint should be inserted to the file, and shown in the UI as one replacement symbol.
Actual behavior
All but the last byte of the UTF-8 sequence is inserted; in turn, showing up as multiple replacement symbols.
For example, U+FEFF "BOM" in UTF-8 is three bytes:
efbbbf. Instead, only the first two:efbbis inserted. This shows up as two replacement symbols (dot on black background, or similar).Similarly, the highest valid Unicode character U+10FFFF is four bytes:
f48fbfbf. Instead, only the first three bytesf48fbfare inserted to the file, showing up as three replacement symbols.Additional context
slang and ncurses builds are both affected, so the problem is probably not there.