fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087) by pszymkowiak · Pull Request #2181 · rtk-ai/rtk

pszymkowiak · 2026-05-31T09:06:56Z

Summary

rtk curl <binary-url> silently corrupted binary downloads (#1087). A
gzip tarball downloaded with rtk curl ... | tar -xzf - failed with
gzip: stdin: not in gzip format.

Root cause

src/core/stream.rs:538 captures stdout via:

stdout: String::from_utf8_lossy(&output.stdout).into_owned()

For binary content, every non-UTF-8 byte is replaced with U+FFFD
(0xEF 0xBF 0xBD, 3 bytes). Gzip magic 1f 8b 08 00 ... arrives as
1f ef bf bd 08 00 .... Same problem for zip, PNG, PDF, ELF, etc.

This was masked for a long time by the SIGPIPE panic (#1004) — once
that was fixed by #1048 (merged 2026-05-29), the underlying corruption
became visible to users.

Fix

Local to src/cmds/cloud/curl_cmd.rs (does not touch the global
exec_capture API — that would impact 141 call sites across 6
modules):

Replace exec_capture(&mut cmd) with cmd.output() directly to
keep stdout as Vec<u8>.
New is_binary(&[u8]) -> bool helper — one line:
```
std::str::from_utf8(bytes).is_err()
```
Correct by construction: the only condition under which
from_utf8_lossy would corrupt the stream is when the bytes are
not valid UTF-8. Catches every binary format without needing a
magic-byte database to maintain.
On binary detection: stdout.write_all(&output.stdout) raw +
timer.track_passthrough(...) (0% savings).
Text/JSON/HTML path unchanged — same from_utf8_lossy +
filter_curl_output flow as before.

Verification

Live tests against 18 real binary formats from public CDNs — all
byte-identical to raw curl:

Category	Samples	Result
Archives	rtk .tar.gz/.zip/.deb/.rpm, ripgrep, bat, fd	✅ 8/8
Executables	kubectl ELF 51 MB, rustup-init 20 MB, gh CLI, hyperfine, tokei	✅ 5/5
Images	GitHub avatar PNG, Wikimedia GIF, WebP, ICO, octocat PNG	✅ 5/5
Documents	W3C PDF, ISO 9899 PDF 4 MB	✅ 2/2
Audio/Video	MP3, MP4, WAV	✅ 3/3
Text regression	JSON, HTML, Markdown, Cargo.toml, plain text	✅ no path-text behavior change

The 18 MB and 51 MB ELF binaries roundtrip byte-perfect, confirming the
fix scales.

Tests

4 new unit tests, all green:

test_is_binary_gzip_magic_is_not_utf8
test_is_binary_valid_utf8_text_is_not_binary (JSON, HTML, ASCII, UTF-8 emoji)
test_is_binary_empty_is_not_binary
test_is_binary_text_with_nul_is_not_binary (NUL is valid UTF-8)

cargo fmt --all, cargo clippy --all-targets -- -D warnings,
cargo test --all — 1986 passed, 0 failed.

Out of scope

The existing text path applies .trim() + println!("\n") to text
responses in pipe mode, so a byte-strict downstream parser may see
±1 byte of trailing whitespace. That's pre-existing behavior, not
introduced by this PR, and worth a separate issue if the agent
ecosystem needs byte-faithful text passthrough.

Closes #1087.

`rtk curl <binary-url>` silently corrupted any binary download (tarballs, zip, png, pdf, ELF, ...) because the response body was captured via `String::from_utf8_lossy` in `core::stream::exec_capture`. Every non-UTF-8 byte got replaced with U+FFFD (3 bytes: 0xEF 0xBF 0xBD), so gzip magic 1f 8b 08 00 arrived at downstream consumers as 1f ef bf bd 08 00 and `tar -xzf` complained "not in gzip format". Now `rtk curl` runs `Command::output()` directly to keep stdout as `Vec<u8>`, then checks `std::str::from_utf8(&bytes).is_err()`. If the response is not valid UTF-8 (i.e. the lossy conversion would corrupt it), raw bytes are written through to stdout via `write_all` and tracking is recorded as passthrough (0% savings — token counts over binary content have no meaning). The text/JSON code path is unchanged. Verified live on 18 real binary formats (rtk's own release artifacts, ripgrep, bat, fd, hyperfine, gh, tokei, kubectl ELF 51MB, rustup-init 20MB, W3C PDF, ISO 9899 PDF 4MB, GitHub avatar PNG, Wikimedia GIF, WebP sample, MP3/MP4/WAV samples) — all byte-identical to raw curl. 10 text regression tests (JSON/HTML/Markdown/Cargo.toml/RFC) confirm the text path keeps its existing behavior. Closes #1087

KuSh

LGTM!

pszymkowiak requested review from KuSh and aeppling May 31, 2026 09:07

KuSh approved these changes May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087)#2181

fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087)#2181
pszymkowiak wants to merge 1 commit into
developfrom
fix/curl-binary-passthrough-1087

pszymkowiak commented May 31, 2026

Uh oh!

KuSh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pszymkowiak commented May 31, 2026

Summary

Root cause

Fix

Verification

Tests

Out of scope

Uh oh!

KuSh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants