Skip to content

fix: non-ASCII / UTF-8 robustness (git filenames, gain truncation, proxy capture)#2155

Open
snwsnwsnw wants to merge 4 commits into
rtk-ai:developfrom
snwsnwsnw:fix/non-ascii-utf8-robustness
Open

fix: non-ASCII / UTF-8 robustness (git filenames, gain truncation, proxy capture)#2155
snwsnwsnw wants to merge 4 commits into
rtk-ai:developfrom
snwsnwsnw:fix/non-ascii-utf8-robustness

Conversation

@snwsnwsnw
Copy link
Copy Markdown

Three small, independent fixes for handling non-ASCII (CJK and other multibyte UTF-8) text. Each is in its own commit.

1. git: non-ASCII filenames shown as octal escapes

Git escapes non-ASCII path bytes as octal \nnn by default (core.quotepath=true), and rtk passed that through unchanged.

# before
$ rtk git status
?? "\345\214\205\350\243\271\347\260\275\346\224\266\346\270\205\345\226\256.txt"
# after
$ rtk git status
?? 包裹簽收清單.txt

Affects git status, git log --name-only, git diff --stat, etc. Fix injects -c core.quotepath=false at the single git_cmd() chokepoint. No effect on ASCII paths.

2. gain: byte-index slicing can panic on multibyte input

gain --history and the failure summary truncated command strings with &cmd[..47] / &rec.rtk_cmd[..22] / &rec.raw_command[..37]. A command containing multibyte UTF-8 (a non-ASCII search pattern, a non-ASCII commit message, etc.) can land the cut mid-codepoint and panic. Switched to the existing char-safe utils::truncate().

3. proxy: spurious replacement char at the 1 MiB capture cap

The proxy streaming path caps the captured copy at 1 MiB. When the cap lands inside a multibyte sequence, from_utf8_lossy appended a trailing U+FFFD to the tracked output. decode_captured() trims an incomplete trailing sequence while preserving lossy behavior for genuinely invalid mid-stream bytes. (User-facing stdout was already byte-exact; this only affected the captured copy used for tracking.)

Testing

cargo build --release clean. Verified before/after on a repo with CJK filenames and commit messages for #1; gain --history no longer panics for #2.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 29, 2026

CLA assistant check
All committers have signed the CLA.

snwsnwsnw added 3 commits May 30, 2026 00:09
Git escapes non-ASCII path bytes as octal \nnn by default
(core.quotepath=true). rtk passed this straight through, so
git status / log --name-only / diff --stat showed CJK and other
non-ASCII filenames as unreadable escapes. Inject
-c core.quotepath=false at the single git_cmd() chokepoint;
no effect on ASCII paths.
gain --history and the failure summary sliced command strings by
byte index (&cmd[..47] etc). A command containing multibyte UTF-8
(e.g. a non-ASCII search pattern or commit message) could be cut
mid-codepoint and panic. Use the existing char-safe utils::truncate().
The proxy streaming path caps the captured copy at 1 MiB. When the
cap lands inside a multibyte UTF-8 sequence, from_utf8_lossy emitted a
trailing replacement char into the tracked output. decode_captured()
trims an incomplete trailing sequence while keeping lossy behavior for
genuinely invalid mid-stream bytes.
@snwsnwsnw snwsnwsnw force-pushed the fix/non-ascii-utf8-robustness branch from 2378586 to 69b2f92 Compare May 29, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants