Skip to content

Investigate: root-cause the 'loga' framing desync if it recurs #1819

@rgbkrk

Description

@rgbkrk

Deep-dive follow-up to the framing-desync trio (#1805 reconnect, #1806 per-type cap). The specific bug that started this investigation:

[runtime-agent] Socket read error: frame too large: 1819243560 bytes (max 104857600)

0x6C6F6761 = ASCII "loga" — a 4-byte length prefix happened to land on text bytes starting with "loga" (probably from "Logan" — user's son's name was in a voice-daemon transcript that MAY have routed through a socket near the runtimed one).

Mitigation status

What's still unknown

Whether bytes from another process leaked into our socket (unlikely via Unix permissions but worth confirming), or whether our own send path has a cancel-unsafe write_all that can be interrupted mid-frame. The latter is the more concerning mechanism because it would be internal and reproducible under load.

If it recurs

  1. Capture a full diagnostics tarball with runt-nightly diagnostics — specifically the runtimed.log lines around the error.
  2. Hex-dump the 4 bytes claimed as the length prefix. If they're printable ASCII from our own protocol (e.g. beginning of an Automerge sync message), it's an internal desync. If they're arbitrary text, it's more likely a rogue process.
  3. Look at tokio::select! branches in crates/runtimed/src/runtime_agent.rs that call send_typed_frame — audit whether any sibling future completing could drop the send's future mid-write_all.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions