Skip to content

Use ascii() instead of repr() to escape non-ASCII characters#39

Open
assisted-by-ai wants to merge 1 commit intoKicksecure:masterfrom
assisted-by-ai:claude/unicode-bypass-bugs-BnpRK
Open

Use ascii() instead of repr() to escape non-ASCII characters#39
assisted-by-ai wants to merge 1 commit intoKicksecure:masterfrom
assisted-by-ai:claude/unicode-bypass-bugs-BnpRK

Conversation

@assisted-by-ai
Copy link
Copy Markdown

Summary

Changed the character display logic in unicode_show.py to use Python's ascii() function instead of repr() to ensure that all non-ASCII characters are properly escaped in the output, preventing suspicious characters from appearing literally in terminal output.

Key Changes

  • Modified describe_char() function: Replaced repr(c) with ascii(c) when displaying characters that shouldn't be shown literally
  • Added comprehensive test coverage: New test_printable_non_ascii_chars_are_escaped() test that validates escaping of various character types:
    • Accented letters (é)
    • Cyrillic characters
    • Combining marks
    • CJK ideographs
    • Emoji
    • Currency symbols

Implementation Details

The change addresses a critical safety issue: Python's repr() function only escapes non-printable characters, allowing printable non-ASCII characters (letters, homoglyphs, combining marks, CJK, emoji, symbols, etc.) to pass through literally. Since unicode_show's purpose is to safely display and identify suspicious Unicode characters, using ascii() ensures that all non-ASCII characters are always escaped to their ASCII-safe representation, preventing them from appearing in terminal output.

The test suite verifies that:

  • All expected characters are properly escaped
  • The output is ASCII-only
  • Both stdin and file input modes work correctly

https://claude.ai/code/session_01JiGZC3R3SjVVdNbkUnXjES

describe_char used repr(c) to render suspicious characters in the
description line. In Python 3, repr() only escapes characters that are
not printable, so printable non-ASCII characters — letters (including
Cyrillic/Greek/etc. homoglyphs), CJK, emoji, symbols, and combining
marks — are passed through literally. This lets a suspicious character
slip into unicode-show's own terminal output, defeating the tool's core
purpose: a combining acute accent merges with the adjacent quote, a
Cyrillic 'а' still reads as Latin 'a', etc.

Use ascii(), which always returns an ASCII-only escaped representation,
and add a regression test covering letters, homoglyphs, combining marks,
CJK, emoji, and currency symbols.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants