fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087)#2181
Open
pszymkowiak wants to merge 1 commit into
Open
fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087)#2181pszymkowiak wants to merge 1 commit into
pszymkowiak wants to merge 1 commit into
Conversation
`rtk curl <binary-url>` silently corrupted any binary download (tarballs, zip, png, pdf, ELF, ...) because the response body was captured via `String::from_utf8_lossy` in `core::stream::exec_capture`. Every non-UTF-8 byte got replaced with U+FFFD (3 bytes: 0xEF 0xBF 0xBD), so gzip magic 1f 8b 08 00 arrived at downstream consumers as 1f ef bf bd 08 00 and `tar -xzf` complained "not in gzip format". Now `rtk curl` runs `Command::output()` directly to keep stdout as `Vec<u8>`, then checks `std::str::from_utf8(&bytes).is_err()`. If the response is not valid UTF-8 (i.e. the lossy conversion would corrupt it), raw bytes are written through to stdout via `write_all` and tracking is recorded as passthrough (0% savings — token counts over binary content have no meaning). The text/JSON code path is unchanged. Verified live on 18 real binary formats (rtk's own release artifacts, ripgrep, bat, fd, hyperfine, gh, tokei, kubectl ELF 51MB, rustup-init 20MB, W3C PDF, ISO 9899 PDF 4MB, GitHub avatar PNG, Wikimedia GIF, WebP sample, MP3/MP4/WAV samples) — all byte-identical to raw curl. 10 text regression tests (JSON/HTML/Markdown/Cargo.toml/RFC) confirm the text path keeps its existing behavior. Closes #1087
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rtk curl <binary-url>silently corrupted binary downloads (#1087). Agzip tarball downloaded with
rtk curl ... | tar -xzf -failed withgzip: stdin: not in gzip format.Root cause
src/core/stream.rs:538captures stdout via:For binary content, every non-UTF-8 byte is replaced with U+FFFD
(
0xEF 0xBF 0xBD, 3 bytes). Gzip magic1f 8b 08 00 ...arrives as1f ef bf bd 08 00 .... Same problem for zip, PNG, PDF, ELF, etc.This was masked for a long time by the SIGPIPE panic (#1004) — once
that was fixed by #1048 (merged 2026-05-29), the underlying corruption
became visible to users.
Fix
Local to
src/cmds/cloud/curl_cmd.rs(does not touch the globalexec_captureAPI — that would impact 141 call sites across 6modules):
exec_capture(&mut cmd)withcmd.output()directly tokeep stdout as
Vec<u8>.is_binary(&[u8]) -> boolhelper — one line:from_utf8_lossywould corrupt the stream is when the bytes arenot valid UTF-8. Catches every binary format without needing a
magic-byte database to maintain.
stdout.write_all(&output.stdout)raw +timer.track_passthrough(...)(0% savings).from_utf8_lossy+filter_curl_outputflow as before.Verification
Live tests against 18 real binary formats from public CDNs — all
byte-identical to raw curl:
The 18 MB and 51 MB ELF binaries roundtrip byte-perfect, confirming the
fix scales.
Tests
4 new unit tests, all green:
test_is_binary_gzip_magic_is_not_utf8test_is_binary_valid_utf8_text_is_not_binary(JSON, HTML, ASCII, UTF-8 emoji)test_is_binary_empty_is_not_binarytest_is_binary_text_with_nul_is_not_binary(NUL is valid UTF-8)cargo fmt --all,cargo clippy --all-targets -- -D warnings,cargo test --all— 1986 passed, 0 failed.Out of scope
The existing text path applies
.trim()+println!("\n")to textresponses in pipe mode, so a byte-strict downstream parser may see
±1 byte of trailing whitespace. That's pre-existing behavior, not
introduced by this PR, and worth a separate issue if the agent
ecosystem needs byte-faithful text passthrough.
Closes #1087.