Skip to content

fix(meshcore): unified radio-op lock so raw-frame commands can't false-reject each other (#3725)#3728

Merged
Yeraze merged 2 commits into
mainfrom
fix/meshcore-unified-radio-op-lock-3725
Jun 25, 2026
Merged

fix(meshcore): unified radio-op lock so raw-frame commands can't false-reject each other (#3725)#3728
Yeraze merged 2 commits into
mainfrom
fix/meshcore-unified-radio-op-lock-3725

Conversation

@Yeraze

@Yeraze Yeraze commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Closes #3725.

Problem

Four native-backend handlers send a raw frame via sendToRadioFrame and then once()-race the device's shared, untagged Ok / Sent / Err ack:

Handler Ack awaited
discover_path Sent vs Err
discover_nodes Ok vs Err
request_regions Sent (then BinaryResponse) vs Err
set_device_time Ok vs Err

Because the Err channel carries no correlation tag, a concurrent command's Err can fire the wrong handler's once(Err) and false-reject an unrelated command. This is the command-ack counterpart to the 0x8C reply-theft race fixed in #3722.

Fix — unified lock (issue #3725, option 2)

Generalize the per-instance 0x8C serializer (runExclusiveBinaryResponse) into a single runExclusiveRadioOp that covers both the command-ack window (Ok/Sent/Err) and the 0x8C reply window, and wrap all five fragile handlers (the four above + request_telemetry) in it. One chain, one mental model.

  • Per-source by construction: the lock is an instance field, and each source owns its own backend + connection, so concurrent sources never block each other — only same-connection radio ops serialize.
  • Latency: the only long holder is request_regions' BinaryResponse wait, which it must hold anyway to keep request_telemetry from stealing its reply. The discover_*/set_device_time windows are sub-millisecond.
  • Residual (documented): library commands (e.g. send_message via sendTextMessage) aren't locked and can still emit Err on the shared channel. So request_regions now ignores Err once its Sent ack has arrived — its multi-second reply wait is no longer exposed to a foreign Err (this was the bot's exact fix(meshcore): serialize 0x8C binary-response consumers to prevent reply theft (#3667) #3722-review scenario). The brief discover_*/set_device_time ack windows are left as-is.
  • Bonus: discover_nodes' pendingDiscoverTag is now set inside the lock, just before send, so a queued discovery can't clobber a running one's tag while its 0x8E responses are still arriving.

Tests

  • request_regions ignores a post-Sent foreign Err and still resolves with its region list (fails without the gate).
  • A command-ack op (discover_path) serializes against request_regions under the unified lock — asserted via call-log ordering (no overlap on the ack channel).

486 MeshCore backend/manager tests pass (success=true); tsc clean for the changed file.

🤖 Generated with Claude Code

…e-reject each other (#3725)

Several native-backend handlers send a raw frame and then once()-race the
device's shared, untagged Ok/Sent/Err ack: discover_path, discover_nodes,
request_regions, and set_device_time. Because the Err channel carries no
correlation tag, a concurrent command's Err can fire the wrong handler's
once(Err) and false-reject an unrelated command. This is the command-ack
counterpart to the 0x8C reply-theft race fixed in #3722.

Generalize the per-instance 0x8C serializer into a single unified radio-op
lock (runExclusiveRadioOp, issue #3725 option 2) that covers BOTH the
command-ack window (Ok/Sent/Err) and the 0x8C reply window, and wrap all five
fragile handlers in it. One chain, one mental model; the only long holder is
request_regions' BinaryResponse wait, which it must hold anyway to keep
request_telemetry from stealing its reply. The lock stays an instance field,
so it remains per-source — concurrent sources never block each other.

Because library commands (e.g. send_message via sendTextMessage) are not
locked and can still emit Err on the shared channel, request_regions now
ignores Err once its Sent ack has arrived, so its multi-second reply wait is
not exposed to a foreign Err. The discover_*/set_device_time ack windows are
sub-millisecond. discover_nodes' pendingDiscoverTag is now set inside the lock,
just before send, so a queued discovery can't clobber a running one's tag.

Tests: request_regions ignores a post-Sent foreign Err; a command-ack op
(discover_path) serializes against request_regions under the unified lock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review — fix(meshcore): unified radio-op lock (#3725)

  • Gather context and understand the request
  • Read meshcoreNativeBackend.ts changes
  • Read test file changes
  • Analyze implementation for correctness and edge cases
  • Post comprehensive review feedback

Summary

This PR introduces a unified per-instance mutex (runExclusiveRadioOp) to serialize five "uncorrelated" native-backend handlers that race on the shared, tag-less Ok/Sent/Err ack channel. The design is sound and well-documented. A few observations below — mostly minor, with one I'd flag for discussion.


Code Quality & Design

runExclusiveRadioOp — the lock itself (meshcoreNativeBackend.ts:868–877)

The implementation is idiomatic and correct. The chain-continuation pattern:

const run = this.radioOpChain.then(fn, fn);
this.radioOpChain = run.then(() => {}, () => {});
return run;

correctly (a) passes fn to both .then branches so a prior rejection still releases the next waiter, (b) swallows the chain's settlement to prevent unhandled rejections propagating through the hidden tail, and (c) returns the raw run promise so the caller still sees the real rejection. This is a clean, well-understood "async mutex" pattern.

Per-source isolation is inherent — because radioOpChain is an instance field and each source gets its own MeshCoreNativeBackend, cross-source contention is impossible. Good call not adding a separate guard for it.


Potential Issue: discover_nodes tag race with the lock (minor / good fix)

The PR specifically calls out that pendingDiscoverTag is now set inside the lock (line 987):

await this.runExclusiveRadioOp(() => new Promise<void>((resolve, reject) => {
  this.pendingDiscoverTag = tag;   // ← inside the lock
  ...
  c.sendToRadioFrame(frame);
}));

This is correct. Before this change, a second discover_nodes call queued behind the first could write its tag before the first call had finished sending, causing the 0x8E push handler to incorrectly filter responses. The fix ensures the tag is only set immediately before the frame is transmitted.


Potential Issue: discover_path resolvePublicKey outside lock (discussion point)

In discover_path, resolvePublicKey (which calls c.getContacts() — an async I/O call) happens before runExclusiveRadioOp is called (lines 928–956):

case 'discover_path': {
  const publicKey = await this.resolvePublicKey(params.public_key as string);  // ← outside lock
  ...
  await this.runExclusiveRadioOp(() => new Promise<void>(...));

This is intentional and correct: the lock only needs to cover the Sent/Err ack window (the radio op itself), not the contact-list lookup. The same pattern applies to request_regions (resolvePublicKey before lock, frame send inside lock). No issue — just worth noting for future maintainers.


Residual: unlocked send_message can still emit Err during lock windows

The PR and the JSDoc comment at line 859 both document this explicitly:

library commands (e.g. send_message via sendTextMessage) aren't locked and can still emit Err on the shared channel

And the fix for request_regions — ignoring Err once sentReceived is true (line 1059) — correctly handles the long BinaryResponse wait window. For discover_nodes/discover_path/set_device_time, the windows are "sub-millisecond" per the PR description. This feels reasonable in practice, though worth flagging that it leaves a theoretical false-rejection path for those three short-window commands if send_message fires at exactly the wrong moment. The PR accurately documents this as the known residual risk.


request_regionsc.once(K.ResponseCodes.Err, onErr) (line 1076)

Using once for Err is the right call for the pre-sentReceived window (we only want to catch our send's failure). But after onSent fires and sentReceived = true, onErr is still registered. The handler immediately returns when sentReceived is true (line 1059), but the once listener is consumed the first time any Err fires, so if a second foreign Err fires later, there's no listener to accidentally consume — this is safe. The test at line 428 confirms the case where a foreign Err arrives after Sent.

One subtle note: onErr uses once, so if two foreign Errs fire after Sent, only the first would hit the (no-op) guard. This is fine since once removes the listener after the first call anyway.


Test Coverage — Solid

The two new tests directly validate the PR's stated goals:

  1. request_regions ignores a foreign Err after Sent (line 428) — tests the !sentReceived gate. Would have failed on the pre-fix code, as stated. ✅
  2. serializes discover_path against request_regions (line 458) — uses call-log ordering to assert no ack-channel overlap between the two ops. The assertion at line 504 (['regions:send', 'regions:reply', 'discover_path:send', 'discover_path:ack']) is tight and would catch interleaving. ✅

Existing tests (serialization of 0x8C consumers, timeout chain advancement) remain valid and still verify the prior fix's behavior.

One gap worth noting: There's no test for set_device_time being serialized under the lock alongside another op. Given it uses the same pattern as discover_nodes, this is low-risk, but a parallel test matching the discover_path serialization test would round out coverage for completeness.


Minor Nit: void c.sendToRadioFrame(frame) in request_regions (line 1077)

sendToRadioFrame returns void (it's a fire-and-forget frame send per the connection API), so void here is technically redundant but harmless — just suppresses any linting about a floating promise. No action needed.


Verdict

Approve. The core fix is correct, well-motivated, and the implementation is clean. The residual (unlocked library commands) is documented and mitigated where it matters most (the long request_regions wait). Test coverage directly validates the two reported failure scenarios. No blocking concerns.

…ied lock

Addresses the Claude review's one completeness gap on #3728: set_device_time
was wrapped in runExclusiveRadioOp but lacked a serialization test. Mirrors the
discover_path test (concurrent request_regions, assert no ack-channel overlap),
using a non-overlap pairing assertion since set_device_time has no await before
the lock so acquisition order is microtask-dependent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@Yeraze

Yeraze commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

Thanks for the review. The one completeness gap — no serialization test for set_device_time — is addressed in 7b8311b: a parallel test fires set_device_time concurrently with request_regions and asserts no ack-channel overlap (non-overlap pairing assertion, since set_device_time has no await before the lock so acquisition order is microtask-dependent).

The other points were all self-resolved in your review (resolvePublicKey-outside-lock is intentional; the unlocked-send_message residual is documented and mitigated for the only long window via the post-Sent Err gate; the once(Err) and void notes are safe/harmless) — no changes needed there.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review — fix(meshcore): unified radio-op lock (#3725)

  • Gather context and understand the request
  • Read meshcoreNativeBackend.ts changes
  • Read test file changes
  • Analyze implementation for correctness and edge cases
  • Post comprehensive review feedback

Summary

This PR introduces runExclusiveRadioOp, a per-instance async mutex that serializes five native-backend handlers sharing the tag-less Ok/Sent/Err ack channel and the 0x8C BinaryResponse push. This is a follow-on to #3722. The commit 7b8311b (pushed after the first review) adds a set_device_time serialization test, completing the coverage gap noted earlier.

The implementation is correct and the design is sound. Notes below in descending order of significance.


Code Quality & Design

runExclusiveRadioOp pattern (meshcoreNativeBackend.ts:868–877)

The chain-continuation pattern is a well-known async mutex idiom and is implemented correctly:

  • Passing fn to both .then branches ensures a prior rejection still releases the next waiter
  • The hidden run.then(() => {}, () => {}) tail swallows the chain's settlement to prevent unhandled rejections propagating
  • The raw run promise is returned to the caller, so rejections still surface where they belong

No issues here.


Potential Issue: set_device_timec.sendCommandSetDeviceTime after once listeners (minor)

In the set_device_time handler (meshcoreNativeBackend.ts:1512–1526):

await this.runExclusiveRadioOp(() => new Promise<void>((resolve, reject) => {
  const onOk = () => { ... resolve(); };
  const onErr = () => { ... reject(...); };
  c.once(this.constants!.ResponseCodes.Ok, onOk);
  c.once(this.constants!.ResponseCodes.Err, onErr);
  c.sendCommandSetDeviceTime(epoch).catch(reject);
}));

c.sendCommandSetDeviceTime itself internally uses once(Ok) / once(Err) in the meshcore.js library (it drives the same ack channel). This means both the library's listener and this wrapper's once(Ok, onOk) are registered concurrently. Whichever fires first consumes the event — under normal conditions this will be fine (both resolve on Ok), but the library's once(Ok) could be consumed by our onOk leaving the library's internal state dangling, or vice versa.

This is the same pattern already used for discover_nodes (OK, not Sent), and appears intentional — the lock prevents concurrent collisions, and the double-listener race is within the same serial segment. Worth noting for future maintainers that sendCommandSetDeviceTime shouldn't be replaced with a raw-frame approach without understanding this interaction.

Net verdict: Not a bug in the current serialized context, but a subtle footgun if sendCommandSetDeviceTime is ever called outside the lock.


Potential Issue: discover_pathresolvePublicKey not guarded against null result from 64-byte key path

In discover_path (meshcoreNativeBackend.ts:928–930):

const publicKey = await this.resolvePublicKey(params.public_key as string);
if (!publicKey) throw new Error('Discover-path target not found');
if (publicKey.length !== 32) {
  throw new Error(`Expected 32-byte public key, got ${publicKey.length}`);
}

resolvePublicKey returns a Uint8Array from hexToBytes when given a 64-char hex string. hexToBytes returns number[] which is then passed to Uint8Array.from(), so the 32-byte check is correct. No bug here — just noting the flow is sound.


request_regionsonErr once listener leftover after sentReceived (pre-existing, documented)

After onSent fires, the once(Err, onErr) listener is already registered. When a foreign Err fires, onErr correctly no-ops (if (sentReceived) return), and the once removes the listener. If no Err fires after Sent, the listener is removed by cleanup() via c.off(K.ResponseCodes.Err, onErr). This is correct and the test at line 428 validates it.


Test Coverage — Comprehensive

The new commit (7b8311b) closes the gap from the prior review:

  1. request_regions ignores a foreign Err after Sent (line 428) — validates the !sentReceived gate. ✅
  2. serializes discover_path against request_regions (line 458) — strict call-log ordering, would catch interleaving. ✅
  3. serializes set_device_time against request_regions (line 507) — uses pairing assertion (not absolute order) since set_device_time has no await before acquiring the lock, so acquisition order is microtask-dependent. ✅

The pairing-vs-ordering distinction in the set_device_time test comment is well-documented and the assertions correctly reflect the nondeterminism.

One minor nit: the mock for sendCommandSetDeviceTime in the test (conn.sendCommandSetDeviceTime) is set up on the connection object directly, which works because the set_device_time case calls c.sendCommandSetDeviceTime(epoch) where c = this.connection. Clean.


Performance

The only long lock-holder is request_regions' BinaryResponse wait (up to 15s by default). All others (discover_nodes, discover_path, set_device_time) are sub-millisecond ack windows. This is acceptable since:

  • These are infrequent admin/discovery operations, not message-path
  • request_regions was already the serialized gatekeeper for the 0x8C channel
  • Per-source isolation means unrelated connections are unaffected

Security

No security concerns. The changes are purely concurrency control within an already-authenticated connection handler.


Verdict

Approve. The implementation is correct and the previous review gap (no set_device_time serialization test) has been addressed by 7b8311b. The sendCommandSetDeviceTime double-listener observation is worth knowing but is not a defect in the serialized context. All stated goals are met.

View job

@Yeraze Yeraze merged commit ced8de6 into main Jun 25, 2026
20 checks passed
@Yeraze Yeraze deleted the fix/meshcore-unified-radio-op-lock-3725 branch June 25, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(meshcore): raw-frame commands share the uncorrelated Err ack channel and can false-reject each other

1 participant