Deep-dive follow-up to the framing-desync trio (#1805 reconnect, #1806 per-type cap). The specific bug that started this investigation:
[runtime-agent] Socket read error: frame too large: 1819243560 bytes (max 104857600)
0x6C6F6761 = ASCII "loga" — a 4-byte length prefix happened to land on text bytes starting with "loga" (probably from "Logan" — user's son's name was in a voice-daemon transcript that MAY have routed through a socket near the runtimed one).
Mitigation status
What's still unknown
Whether bytes from another process leaked into our socket (unlikely via Unix permissions but worth confirming), or whether our own send path has a cancel-unsafe write_all that can be interrupted mid-frame. The latter is the more concerning mechanism because it would be internal and reproducible under load.
If it recurs
- Capture a full diagnostics tarball with
runt-nightly diagnostics — specifically the runtimed.log lines around the error.
- Hex-dump the 4 bytes claimed as the length prefix. If they're printable ASCII from our own protocol (e.g. beginning of an Automerge sync message), it's an internal desync. If they're arbitrary text, it's more likely a rogue process.
- Look at
tokio::select! branches in crates/runtimed/src/runtime_agent.rs that call send_typed_frame — audit whether any sibling future completing could drop the send's future mid-write_all.
Related
Deep-dive follow-up to the framing-desync trio (#1805 reconnect, #1806 per-type cap). The specific bug that started this investigation:
0x6C6F6761= ASCII"loga"— a 4-byte length prefix happened to land on text bytes starting with "loga" (probably from "Logan" — user's son's name was in a voice-daemon transcript that MAY have routed through a socket near the runtimed one).Mitigation status
Presenceframes cap at 1 MiB, so a desync on that channel is rejected before the allocator tries to honor a bogus length.What's still unknown
Whether bytes from another process leaked into our socket (unlikely via Unix permissions but worth confirming), or whether our own send path has a cancel-unsafe
write_allthat can be interrupted mid-frame. The latter is the more concerning mechanism because it would be internal and reproducible under load.If it recurs
runt-nightly diagnostics— specifically the runtimed.log lines around the error.tokio::select!branches incrates/runtimed/src/runtime_agent.rsthat callsend_typed_frame— audit whether any sibling future completing could drop the send's future mid-write_all.Related