Skip to content

fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087)#2181

Open
pszymkowiak wants to merge 1 commit into
developfrom
fix/curl-binary-passthrough-1087
Open

fix(curl): passthrough binary downloads to prevent UTF-8 corruption (#1087)#2181
pszymkowiak wants to merge 1 commit into
developfrom
fix/curl-binary-passthrough-1087

Conversation

@pszymkowiak
Copy link
Copy Markdown
Collaborator

Summary

rtk curl <binary-url> silently corrupted binary downloads (#1087). A
gzip tarball downloaded with rtk curl ... | tar -xzf - failed with
gzip: stdin: not in gzip format.

Root cause

src/core/stream.rs:538 captures stdout via:

stdout: String::from_utf8_lossy(&output.stdout).into_owned()

For binary content, every non-UTF-8 byte is replaced with U+FFFD
(0xEF 0xBF 0xBD, 3 bytes). Gzip magic 1f 8b 08 00 ... arrives as
1f ef bf bd 08 00 .... Same problem for zip, PNG, PDF, ELF, etc.

This was masked for a long time by the SIGPIPE panic (#1004) — once
that was fixed by #1048 (merged 2026-05-29), the underlying corruption
became visible to users.

Fix

Local to src/cmds/cloud/curl_cmd.rs (does not touch the global
exec_capture API — that would impact 141 call sites across 6
modules):

  1. Replace exec_capture(&mut cmd) with cmd.output() directly to
    keep stdout as Vec<u8>.
  2. New is_binary(&[u8]) -> bool helper — one line:
    std::str::from_utf8(bytes).is_err()
    Correct by construction: the only condition under which
    from_utf8_lossy would corrupt the stream is when the bytes are
    not valid UTF-8. Catches every binary format without needing a
    magic-byte database to maintain.
  3. On binary detection: stdout.write_all(&output.stdout) raw +
    timer.track_passthrough(...) (0% savings).
  4. Text/JSON/HTML path unchanged — same from_utf8_lossy +
    filter_curl_output flow as before.

Verification

Live tests against 18 real binary formats from public CDNs — all
byte-identical to raw curl
:

Category Samples Result
Archives rtk .tar.gz/.zip/.deb/.rpm, ripgrep, bat, fd ✅ 8/8
Executables kubectl ELF 51 MB, rustup-init 20 MB, gh CLI, hyperfine, tokei ✅ 5/5
Images GitHub avatar PNG, Wikimedia GIF, WebP, ICO, octocat PNG ✅ 5/5
Documents W3C PDF, ISO 9899 PDF 4 MB ✅ 2/2
Audio/Video MP3, MP4, WAV ✅ 3/3
Text regression JSON, HTML, Markdown, Cargo.toml, plain text ✅ no path-text behavior change

The 18 MB and 51 MB ELF binaries roundtrip byte-perfect, confirming the
fix scales.

Tests

4 new unit tests, all green:

  • test_is_binary_gzip_magic_is_not_utf8
  • test_is_binary_valid_utf8_text_is_not_binary (JSON, HTML, ASCII, UTF-8 emoji)
  • test_is_binary_empty_is_not_binary
  • test_is_binary_text_with_nul_is_not_binary (NUL is valid UTF-8)

cargo fmt --all, cargo clippy --all-targets -- -D warnings,
cargo test --all — 1986 passed, 0 failed.

Out of scope

The existing text path applies .trim() + println!("\n") to text
responses in pipe mode, so a byte-strict downstream parser may see
±1 byte of trailing whitespace. That's pre-existing behavior, not
introduced by this PR, and worth a separate issue if the agent
ecosystem needs byte-faithful text passthrough.

Closes #1087.

`rtk curl <binary-url>` silently corrupted any binary download
(tarballs, zip, png, pdf, ELF, ...) because the response body was
captured via `String::from_utf8_lossy` in `core::stream::exec_capture`.
Every non-UTF-8 byte got replaced with U+FFFD (3 bytes: 0xEF 0xBF 0xBD),
so gzip magic 1f 8b 08 00 arrived at downstream consumers as
1f ef bf bd 08 00 and `tar -xzf` complained "not in gzip format".

Now `rtk curl` runs `Command::output()` directly to keep stdout as
`Vec<u8>`, then checks `std::str::from_utf8(&bytes).is_err()`. If the
response is not valid UTF-8 (i.e. the lossy conversion would corrupt
it), raw bytes are written through to stdout via `write_all` and
tracking is recorded as passthrough (0% savings — token counts over
binary content have no meaning). The text/JSON code path is unchanged.

Verified live on 18 real binary formats (rtk's own release artifacts,
ripgrep, bat, fd, hyperfine, gh, tokei, kubectl ELF 51MB, rustup-init
20MB, W3C PDF, ISO 9899 PDF 4MB, GitHub avatar PNG, Wikimedia GIF,
WebP sample, MP3/MP4/WAV samples) — all byte-identical to raw curl.
10 text regression tests (JSON/HTML/Markdown/Cargo.toml/RFC) confirm
the text path keeps its existing behavior.

Closes #1087
@pszymkowiak pszymkowiak requested review from KuSh and aeppling May 31, 2026 09:07
Copy link
Copy Markdown
Collaborator

@KuSh KuSh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rtk curl http://....tar.gz | tar -xzf fails due rtk truncation, eg "(153 more lines, 94292 bytes total)"

2 participants