Skip to content

fix(meshcore): serialize 0x8C binary-response consumers to prevent reply theft (#3667)#3722

Merged
Yeraze merged 2 commits into
mainfrom
fix/meshcore-0x8c-response-serialization
Jun 24, 2026
Merged

fix(meshcore): serialize 0x8C binary-response consumers to prevent reply theft (#3667)#3722
Yeraze merged 2 commits into
mainfrom
fix/meshcore-0x8c-response-serialization

Conversation

@Yeraze

@Yeraze Yeraze commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Problem

request_regions and request_telemetry both await the shared PUSH_CODE_BINARY_RESPONSE (0x8C) push, which carries only a 4-byte correlation tag and no responder pubkey. request_regions sends via the raw sendToRadioFrame path, whose Sent ack has no expectedAckCrc, so it can't obtain a tag and falls back to accepting the first 0x8C reply it sees.

If a request_telemetry binary request (same 0x8C push — e.g. the periodic remote-telemetry scheduler) overlaps it on the same connection, the two race for that first reply and mis-attribute each other's payload: a telemetry CayenneLPP body gets parsed as a regions list, or vice versa. Worst case is garbage/empty region chips on a manual, user-triggered action.

This is a latent issue that survives the already-merged timeout fix (#3702 / 3a035711) — that fix made the null-tag fallback work, but the fallback is tag-blind and the 0x8C event is shared.

Fix

  • runExclusiveBinaryResponse — chains every 0x8C-awaiting op on a single per-instance promise so exactly one is listening at a time. Wraps both request_regions and request_telemetry.
  • Source-aware by construction: the lock is an instance field, and each source owns its own MeshCoreNativeBackend + physical connection. Same-connection 0x8C ops serialize; concurrent sources never block each other. A module-level global would have wrongly coupled sources — this doesn't.
  • The lock is held only until each op's listeners tear down (bounded by its internal timeout), so it always releases.
  • Defense-in-depth: the regions parser now drops any token that isn't printable ASCII, so a stray non-regions payload can't render garbage region chips.

Test

Adds a regression test firing request_regions and request_telemetry concurrently on one connection, asserting (a) each resolves with its own payload (regions = ['saxony'], not the telemetry bytes) and (b) the call log shows no overlap — each :send is immediately followed by its own :reply.

Verified: 481 MeshCore backend/manager tests pass (success=true); tsc clean for the changed file.

🤖 Generated with Claude Code

…ply theft (#3667)

request_regions and request_telemetry both await the shared
PUSH_CODE_BINARY_RESPONSE (0x8C) push. That push carries only a 4-byte
correlation tag and no responder pubkey. request_regions sends via the raw
sendToRadioFrame path, whose Sent ack has no expectedAckCrc, so it can't
obtain a tag and falls back to accepting the first 0x8C reply it sees. If a
request_telemetry binary request (same 0x8C push, e.g. the periodic remote-
telemetry scheduler) overlaps it on the same connection, the two race for
that first reply and mis-attribute each other's payload — a telemetry
CayenneLPP body gets parsed as a regions list, or vice versa.

Fix: chain every 0x8C-awaiting op on a single per-instance promise
(runExclusiveBinaryResponse) so exactly one is listening at a time. The lock
is an instance field, so it is inherently source-aware: each source owns its
own MeshCoreNativeBackend and physical connection, so same-connection 0x8C
ops serialize while concurrent sources never block each other. The lock is
held only until each op's listeners tear down (bounded by its internal
timeout), so it always releases.

Defense-in-depth: the regions parser now drops any token that isn't printable
ASCII, so a stray non-regions payload can't render garbage region chips.

Adds a regression test firing request_regions and request_telemetry
concurrently and asserting each resolves with its own payload with no overlap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(meshcore): serialize 0x8C binary-response consumers

  • Gather context
  • Read changed files
  • Analyze implementation
  • Review test coverage
  • Post final review

Summary

This PR fixes a real race condition where request_regions and request_telemetry both listen for PUSH_CODE_BINARY_RESPONSE (0x8C) on the same connection without coordination. The fix uses a promise-chain serializer (runExclusiveBinaryResponse) to ensure only one 0x8C consumer is active at a time. The approach is correct and the implementation is clean.


Code Quality & Best Practices

runExclusiveBinaryResponse (lines 850–858): The implementation is idiomatic and correct. Using a "chain of promises" (mutex-via-promise) pattern is a well-known Node.js concurrency idiom and avoids needing external locking primitives. The key properties are sound:

private runExclusiveBinaryResponse<T>(fn: () => Promise<T>): Promise<T> {
    const run = this.binaryResponseChain.then(fn, fn);
    this.binaryResponseChain = run.then(() => {}, () => {});
    return run;
}
  • Using fn for both the onFulfilled and onRejected handlers of .then(fn, fn) is intentional — it ensures a prior failure in the chain doesn't block subsequent operations. This is correct.
  • The chain tail is swallowed (run.then(() => {}, () => {})) to prevent unhandled rejections propagating on binaryResponseChain. The caller receives run directly (and its rejection). Correct.
  • The lock is per-instance, so concurrent sources never block each other. This matches the stated design intent.

One subtle issue worth flagging: runExclusiveBinaryResponse serializes at the dispatch level — it wraps the entire inner Promise<Uint8Array> for both callers. However, for request_telemetry, it wraps c.sendBinaryRequest(...) which is a library call. If the library itself already performs its own BinaryResponse tag-matching internally, the outer serializer adds latency but is still correct as a defense-in-depth measure. The comment in request_telemetry (line 1393–1395) explains this rationale clearly.

Defense-in-depth filter for request_regions (line 1059):

.filter((s) => s.length > 0 && /^[\x20-\x7e]+$/.test(s));

The regex is correct (printable ASCII). The \x20 lower bound correctly includes space. This is a solid secondary guard.


Potential Issues

1. onErr listener scope in request_regions (line 1028):

const onErr = () => { cleanup(); reject(new Error('Device rejected regions request')); };

onErr is registered with c.once(K.ResponseCodes.Err, onErr). However, the Err response code is a general-purpose error channel shared with all command failures on this connection. If another command (e.g., a concurrent send_message) triggers an Err response between the regions sendToRadioFrame and the BinaryResponse arrival, it will incorrectly reject the request_regions promise. This is a pre-existing bug that this PR does not introduce, but it's worth noting: request_regions does not own the Err listener exclusively, so false rejections are possible. The same pattern exists in discover_nodes and discover_path.

2. sendBinaryRequest return type for request_telemetry (lines 1395–1397):

const responseData: Uint8Array = await this.runExclusiveBinaryResponse(
    () => c.sendBinaryRequest(publicKey, [reqType]),
);

c.sendBinaryRequest is any-typed (from AnyConnection). If the library's sendBinaryRequest returns the full binary response object (e.g. { tag, responseData }) rather than the raw Uint8Array, this would pass responseData as an object to CayenneLpp.parse(). However, since the tests pass, either the mock correctly models the library's behavior or existing production usage has validated this. Not a regression introduced here.

3. Test mock sendBinaryRequest resolves with telemetryBytes directly (test line 356–357):

(conn as any).sendBinaryRequest = (_pubkey: Uint8Array, _req: number[]) =>
    new Promise<Uint8Array>((resolve) => {
        ...
        resolve(telemetryBytes);
    });

The mock resolves with the raw Uint8Array, which confirms the production code treats the return of sendBinaryRequest as raw bytes. This is consistent.


Performance Considerations

  • The serialization adds latency: if request_regions is waiting for a 15-second timeout (its default), request_telemetry will block for up to 15 seconds. However, the PR description notes the lock is "bounded by each op's internal timeout" — this is correct. The tradeoff (correctness vs. latency) is appropriate given the alternative is data corruption.
  • No unbounded queue growth: the chain advances on every settlement (both fulfillment and rejection), so there's no accumulation of pending ops beyond the queue of in-flight calls.

Security Considerations

No security concerns. The ASCII filter on regions output is a good guard against injecting control characters into the UI.


Test Coverage

The new test (lines 318–381) is well-structured:

  • Isolation: It correctly mocks both sendToRadioFrame (for regions) and sendBinaryRequest (for telemetry) with their own call log entries.
  • Assertion 1 (no cross-contamination): regionsRes.data.regions must equal ['saxony'], not []. If the regions listener stole the telemetry reply, the 4-byte telemetry payload would parse as clock=0xcafef00d with no names.
  • Assertion 2 (no overlap): The callLog pairing check (op(callLog[0]) === op(callLog[1])) is a clean way to verify serialization without asserting absolute ordering.

Minor observation: The test name mentions #3667 but the PR is #3722. The issue reference in the test string (#3667) refers to the upstream feature issue, which is correct context. No change needed.

Missing edge case: There's no test for what happens when the first op (e.g., request_regions) times out — does the chain advance and allow request_telemetry to proceed? Based on the implementation, the timeout fires reject(new Error(...))cleanup() removes listeners → the runExclusiveBinaryResponse promise rejects → chain advances via the swallowed .then(() => {}, () => {}). This should work correctly, but a timeout-recovery test would add confidence.


Verdict

Approve. The fix correctly addresses the stated race condition using a well-understood concurrency primitive. Code is clean, well-commented, and the test adequately covers the key scenario. The one gap (no timeout-recovery test for chain advancement) is minor and the logic is straightforwardly correct from the implementation.

Addresses the Claude review's one in-scope suggestion on #3722: add a
timeout-recovery test for runExclusiveBinaryResponse. Verifies that when
request_regions times out (no BinaryResponse ever arrives), the queued
request_telemetry still proceeds — proving the serializer releases on
rejection, not just resolution, so a dead repeater can't wedge the
connection's 0x8C queue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@Yeraze

Yeraze commented Jun 24, 2026

Copy link
Copy Markdown
Owner Author

Thanks for the thorough review. Triage of the points raised:

  • Timeout-recovery test (the one in-scope gap): added in 4160290 — fires request_regions with a short timeout that never receives a BinaryResponse, and asserts the queued request_telemetry still runs (callLog order regions:send → telemetry:send → telemetry:reply). Confirms the chain releases on rejection, not just resolution.
  • Shared onErr listener (concern "Claude PR Assistant workflow" #1): agreed this is a real pre-existing issue (a concurrent send_message/discover_* Err can falsely reject request_regions), but it predates this PR and isn't part of the 0x8C reply-theft race. Leaving it out of this PR's scope to keep the diff minimal; flagging as a follow-up.
  • sendBinaryRequest any return type (concern feat: traceroute highlighting and UI improvements #2) and test-mock consistency (concern "Update Claude PR Assistant workflow" #3): no change — as noted, production usage and the mock both treat the return as raw Uint8Array, and the tests confirm it.

CI is green; the added test brings the file to 24 passing.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(meshcore): serialize 0x8C binary-response consumers

  • Gather context
  • Read changed files
  • Analyze new commit (timeout-recovery test)
  • Post final review

Follow-up Review (latest commit: 4160290)

This review covers the newly added timeout-recovery test, addressing the one open gap from the previous review.


New Test: advances the 0x8C chain when an op times out so the next op still runs (#3667) (line 383)

The test is well-constructed and directly validates the chain-advancement-on-rejection property.

Structural correctness:

  • request_regions is issued with timeout_ms: 30 — short enough to keep the test fast, and the sendToRadioFrame mock intentionally never emits Sent or BinaryResponse, so the inner timeout fires after ~30ms and rejects.
  • request_telemetry is queued concurrently. Because both ops call await this.resolvePublicKey(...) before acquiring the chain slot, the FIFO microtask ordering comment at line 411–412 is accurate: regions wins the lock first because it's the first Promise.all element.
  • The final assertion (callLog equals ['regions:send', 'telemetry:send', 'telemetry:reply']) is a deterministic ordered-equality check, which is stronger than the pairing check in the previous test. This is correct here because the serializer guarantees telemetry cannot start until regions times out.

One observation on test timing: timeout_ms: 30 works against the real setTimeout. In a CI environment under load, 30ms is usually sufficient, but if Node.js event-loop latency spikes (e.g. GC pause), the Promise.all could theoretically time out before telemetry resolves. The test doesn't have its own outer timeout, so it would hang rather than fail fast in that edge case — not a correctness issue, just a CI robustness note. If this becomes flaky, timeout_ms: 100 would give more headroom without meaningfully slowing the suite.

What this test proves:

  1. A request_regions rejection (via timeout) releases the chain lock — confirmed by telemetry:send appearing in the log.
  2. telemetry:reply follows telemetry:send — telemetry ran to completion, not just started.
  3. regionsRes.success === false and regionsRes.error matches /timed out/i — the rejection is surfaced correctly to the caller, not swallowed.
  4. telemetryRes.success === true — the queued op ran cleanly despite the predecessor's failure.

This is exactly the right set of assertions for this scenario.


Overall Assessment

The PR is complete. Both originally identified requirements are now tested:

Scenario Test
Concurrent 0x8C ops don't steal each other's reply serializes overlapping 0x8C consumers... (line 318)
Chain advances after timeout/rejection advances the 0x8C chain when an op times out... (line 383)

The implementation is correct, the test suite is comprehensive, and all previously raised concerns have been addressed or explicitly scoped out. The one minor nit (30ms timeout may be tight under heavy CI load) is not a blocker.

Approve.

@Yeraze Yeraze merged commit 2d38dd5 into main Jun 24, 2026
20 checks passed
@Yeraze Yeraze deleted the fix/meshcore-0x8c-response-serialization branch June 24, 2026 23:14
Yeraze added a commit that referenced this pull request Jun 25, 2026
…e-reject each other (#3725) (#3728)

* fix(meshcore): unified radio-op lock so raw-frame commands can't false-reject each other (#3725)

Several native-backend handlers send a raw frame and then once()-race the
device's shared, untagged Ok/Sent/Err ack: discover_path, discover_nodes,
request_regions, and set_device_time. Because the Err channel carries no
correlation tag, a concurrent command's Err can fire the wrong handler's
once(Err) and false-reject an unrelated command. This is the command-ack
counterpart to the 0x8C reply-theft race fixed in #3722.

Generalize the per-instance 0x8C serializer into a single unified radio-op
lock (runExclusiveRadioOp, issue #3725 option 2) that covers BOTH the
command-ack window (Ok/Sent/Err) and the 0x8C reply window, and wrap all five
fragile handlers in it. One chain, one mental model; the only long holder is
request_regions' BinaryResponse wait, which it must hold anyway to keep
request_telemetry from stealing its reply. The lock stays an instance field,
so it remains per-source — concurrent sources never block each other.

Because library commands (e.g. send_message via sendTextMessage) are not
locked and can still emit Err on the shared channel, request_regions now
ignores Err once its Sent ack has arrived, so its multi-second reply wait is
not exposed to a foreign Err. The discover_*/set_device_time ack windows are
sub-millisecond. discover_nodes' pendingDiscoverTag is now set inside the lock,
just before send, so a queued discovery can't clobber a running one's tag.

Tests: request_regions ignores a post-Sent foreign Err; a command-ack op
(discover_path) serializes against request_regions under the unified lock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

* test(meshcore): add set_device_time serialization test under the unified lock

Addresses the Claude review's one completeness gap on #3728: set_device_time
was wrapped in runExclusiveRadioOp but lacked a serialization test. Mirrors the
discover_path test (concurrent request_regions, assert no ack-channel overlap),
using a non-overlap pairing assertion since set_device_time has no await before
the lock so acquisition order is microtask-dependent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant