[Darwin] mDNS resolve cancel/restart drops in-flight results, causing ~1s NodeID resolve retries#72273
[Darwin] mDNS resolve cancel/restart drops in-flight results, causing ~1s NodeID resolve retries#72273woody-apple wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a deferred teardown mechanism for ResolveContext on Darwin platforms to prevent dropping in-flight resolve results when the consumer count drops to zero, and adds corresponding unit tests. The reviewer identified critical security issues where reusing a deferred-teardown context without updating its callback, context, or delegate pointers could lead to Use-After-Free (UAF) vulnerabilities. Additionally, a potential UAF was flagged in the test helper RunOnMatterEventLoopSync during timeouts, with a suggestion to use thread-safe reference counting.
| // Detect (and log) the case where a different upper-layer caller is | ||
| // re-binding to the same instance name with a different callback or | ||
| // context. We intentionally do not rebind, because doing so could lose | ||
| // an in-flight result delivery to the original caller. The upper layer | ||
| // (Discovery) routes results back through the same callback path keyed | ||
| // off instance name, so a multiplexed dispatcher is expected; but if | ||
| // the dispatch tuple differs, surface that as an error log so any | ||
| // breakage of that assumption is observable. | ||
| if (!reused && (existingCtx->callback != callback || existingCtx->context != context)) | ||
| { | ||
| ChipLogError(Discovery, | ||
| "Mdns: Reusing deferred-teardown ResolveContext for %s with mismatched callback/context; " | ||
| "new caller will not receive results until the next Resolve", | ||
| StringOrNullMarker(name)); | ||
| } |
There was a problem hiding this comment.
Reusing a deferred-teardown ResolveContext without updating its callback and context pointers is a critical correctness and security issue.
Since the previous consumer's counter dropped to 0, that consumer has cancelled and may have been destroyed. If we reuse the context but keep the old callback and context pointers:
- The new caller will never receive the resolve results.
- When the resolve completes, the old callback will be invoked with the old context pointer, which is highly likely to be a dangling pointer, leading to a Use-After-Free (UAF) vulnerability or crash.
Since the consumer counter is 0, there are no other active consumers, so it is completely safe to overwrite the callback and context with the new caller's values.
if (existingCtx->callback != callback || existingCtx->context != context)
{
ChipLogDetail(Discovery, "Mdns: Reusing deferred-teardown ResolveContext for %s with new callback/context", StringOrNullMarker(name));
existingCtx->callback = callback;
existingCtx->context = context;
}| if (!reused && existingCtx->context != delegate) | ||
| { | ||
| ChipLogError(Discovery, | ||
| "Mdns: Reusing deferred-teardown ResolveContext for %s with a different delegate; " | ||
| "new delegate will not receive results until the next Resolve", | ||
| StringOrNullMarker(name)); | ||
| } |
There was a problem hiding this comment.
Similar to the callback overload above, reusing the deferred-teardown ResolveContext without updating its context (which holds the DiscoverNodeDelegate pointer) can lead to a Use-After-Free (UAF) or crash if the old delegate has been destroyed.
We should update the context pointer to the new delegate when reusing the context.
if (existingCtx->context != delegate)
{
ChipLogDetail(Discovery, "Mdns: Reusing deferred-teardown ResolveContext for %s with a new delegate", StringOrNullMarker(name));
existingCtx->context = delegate;
}| void RunOnMatterEventLoopSync(std::function<void(void)> fn) | ||
| { | ||
| std::mutex mutex; | ||
| std::condition_variable cv; | ||
| bool done = false; | ||
|
|
||
| struct Args | ||
| { | ||
| std::function<void(void)> * fn; | ||
| std::mutex * mutex; | ||
| std::condition_variable * cv; | ||
| bool * done; | ||
| }; | ||
| auto * args = new Args{ &fn, &mutex, &cv, &done }; | ||
|
|
||
| chip::DeviceLayer::PlatformMgr().ScheduleWork( | ||
| [](intptr_t arg) { | ||
| auto * a = reinterpret_cast<Args *>(arg); | ||
| (*a->fn)(); | ||
| { | ||
| std::lock_guard<std::mutex> lk(*a->mutex); | ||
| *a->done = true; | ||
| } | ||
| a->cv->notify_all(); | ||
| }, | ||
| reinterpret_cast<intptr_t>(args)); | ||
|
|
||
| std::unique_lock<std::mutex> lk(mutex); | ||
| bool finished = cv.wait_for(lk, std::chrono::seconds(2), [&done] { return done; }); | ||
| EXPECT_TRUE(finished); | ||
| delete args; | ||
| } |
There was a problem hiding this comment.
In RunOnMatterEventLoopSync, if cv.wait_for times out (e.g., due to CI slowness or deadlock), delete args is called and the function returns.
However, the scheduled work on the Matter event loop might still execute later. When it does, it will dereference args (which has been deleted) and fn (which was a reference to a local variable on the stack that has gone out of scope). This results in a Use-After-Free (UAF) and will crash the test runner.
We can solve this robustly by using a simple thread-safe reference count inside Args and storing fn by value, so that whichever thread finishes last safely deletes the heap-allocated Args structure.
void RunOnMatterEventLoopSync(std::function<void(void)> fn)
{
struct Args
{
std::atomic<int> refCount{ 2 };
std::function<void(void)> fn;
std::mutex mutex;
std::condition_variable cv;
bool done = false;
};
auto * args = new Args{ 2, std::move(fn), {}, {}, false };
chip::DeviceLayer::PlatformMgr().ScheduleWork(
[](intptr_t arg) {
auto * a = reinterpret_cast<Args *>(arg);
a->fn();
{
std::lock_guard<std::mutex> lk(a->mutex);
a->done = true;
}
a->cv.notify_all();
if (a->refCount.fetch_sub(1) == 1)
{
delete a;
}
},
reinterpret_cast<intptr_t>(args));
std::unique_lock<std::mutex> lk(args->mutex);
bool finished = args->cv.wait_for(lk, std::chrono::seconds(2), [args] { return args->done; });
EXPECT_TRUE(finished);
if (args->refCount.fetch_sub(1) == 1)
{
delete args;
}
}References
- When a stack-allocated object is passed by reference to a function that stores the reference, ensure the object's lifetime is static to prevent dangling references and undefined behavior.
There was a problem hiding this comment.
Pull request overview
This PR changes the Darwin DNS-SD implementation to defer mDNS resolve teardown briefly after the consumer count drops to zero, avoiding dropped in-flight resolver results and enabling immediate re-resolve calls to reuse existing contexts.
Changes:
- Adds deferred-teardown state and timer APIs for Darwin
ResolveContext. - Reuses deferred resolve contexts in both callback and delegate resolve paths.
- Adds Darwin-only tests covering coalescing and delayed cancellation behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/platform/Darwin/dnssd/DnssdImpl.h |
Adds deferred teardown state and timer helper declarations. |
src/platform/Darwin/dnssd/DnssdImpl.cpp |
Implements deferred teardown scheduling, cancellation, and resolve-context reuse. |
src/platform/Darwin/dnssd/DnssdContexts.cpp |
Cancels pending teardown timers during resolve success/failure dispatch. |
src/platform/tests/TestDnssd.cpp |
Adds Darwin tests for resolve teardown coalescing and cancellation propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| chip::DeviceLayer::SystemLayer().CancelTimer(OnResolveDeferredTeardown, existingCtx); | ||
| existingCtx->deferredTeardownScheduled = false; | ||
| (*existingCtx->consumerCounter)++; |
| } | ||
| chip::DeviceLayer::SystemLayer().CancelTimer(OnResolveDeferredTeardown, existingCtx); | ||
| existingCtx->deferredTeardownScheduled = false; | ||
| (*existingCtx->consumerCounter)++; |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #72273 +/- ##
=======================================
Coverage 55.52% 55.52%
=======================================
Files 1630 1630
Lines 111127 111127
Branches 13418 13418
=======================================
+ Hits 61706 61707 +1
+ Misses 49421 49420 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
PR #72273: Size comparison from adba29b to e7af012 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
e7af012 to
da6e50a
Compare
da6e50a to
1d9c0cf
Compare
| if (!existingCtx->deferredTeardownScheduled || *existingCtx->consumerCounter != 0) | ||
| { | ||
| continue; | ||
| } | ||
| // Detect (and log) the case where a different upper-layer caller is | ||
| // re-binding to the same instance name with a different callback or | ||
| // context. We intentionally do not rebind, because doing so could lose | ||
| // an in-flight result delivery to the original caller. The upper layer | ||
| // (Discovery) routes results back through the same callback path keyed | ||
| // off instance name, so a multiplexed dispatcher is expected; but if | ||
| // the dispatch tuple differs, surface that as an error log so any | ||
| // breakage of that assumption is observable. | ||
| if (!reused && (existingCtx->callback != callback || existingCtx->context != context)) | ||
| { | ||
| ChipLogError(Discovery, | ||
| "Mdns: Reusing deferred-teardown ResolveContext for %s with mismatched callback/context; " | ||
| "new caller will not receive results until the next Resolve", | ||
| StringOrNullMarker(name)); | ||
| } | ||
| chip::DeviceLayer::SystemLayer().CancelTimer(OnResolveDeferredTeardown, existingCtx); | ||
| existingCtx->deferredTeardownScheduled = false; | ||
| (*existingCtx->consumerCounter)++; |
| if (!existingCtx->deferredTeardownScheduled || *existingCtx->consumerCounter != 0) | ||
| { | ||
| continue; | ||
| } | ||
| if (!reused && existingCtx->context != delegate) | ||
| { | ||
| ChipLogError(Discovery, | ||
| "Mdns: Reusing deferred-teardown ResolveContext for %s with a different delegate; " | ||
| "new delegate will not receive results until the next Resolve", | ||
| StringOrNullMarker(name)); | ||
| } | ||
| chip::DeviceLayer::SystemLayer().CancelTimer(OnResolveDeferredTeardown, existingCtx); | ||
| existingCtx->deferredTeardownScheduled = false; | ||
| (*existingCtx->consumerCounter)++; |
| std::vector<GenericContext *> existingResolves; | ||
| MdnsContexts::GetInstance().FindAllMatchingPredicate( | ||
| [name](GenericContext * item) { | ||
| return item->type == ContextType::Resolve && static_cast<ResolveContext *>(item)->Matches(name); | ||
| }, |
There was a problem hiding this comment.
(FYI: This is Claude) False positive. The lambda at L466-468 only gathers name-matching candidates; the for-loop body filters on the full request shape before rescuing. See L493-508: browseThatCausedResolve != nullptr, requestedInterfaceId != interfaceId, protocol != newProtocol, and requestedDomain != newDomain all continue, falling through to fresh-context allocation at L557-562. callback/context are rebound at L531-537 before the counter bump. type (service type) is fixed by the protocol family in DNS-SD-on-Darwin and is implied by the ResolveContext kind.
| std::vector<GenericContext *> existingResolves; | ||
| MdnsContexts::GetInstance().FindAllMatchingPredicate( | ||
| [name](GenericContext * item) { | ||
| return item->type == ContextType::Resolve && static_cast<ResolveContext *>(item)->Matches(name); | ||
| }, |
There was a problem hiding this comment.
(FYI: This is Claude) False positive. Delegate overload's loop at L581-626 enforces the same scope filters: kind (L589 — only callback==nullptr siblings), browseThatCausedResolve (L594), requestedInterfaceId (L598), protocol (L602), and non-empty requestedDomain (L612). A different delegate, interface, or address family will fail one of these filters and fall through to fresh allocation at L644-648. Delegate is rebound at L630-635 before the counter bump.
3fb25d8 to
e4ccef3
Compare
f9cd4ed to
036f9bf
Compare
| { | ||
| ChipLogDetail(Discovery, "Mdns: Reusing deferred-teardown ResolveContext for %s with a new delegate", | ||
| StringOrNullMarker(name)); | ||
| primaryCtx->context = delegate; | ||
| } | ||
| (*primaryCtx->consumerCounter)++; |
There was a problem hiding this comment.
(FYI: This is Claude) False positive. The delegate overload's kind-mismatch guard at L588-592 (if (existingCtx->callback != nullptr) continue;) ensures we only ever rebind onto a context that was already constructed via the delegate ctor — and the delegate ctor leaves callback as nullptr. So when we hit L630 the existing callback is already null; there is nothing to clear. A callback-overload context can never be rescued by this path.
| [name](GenericContext * item) { | ||
| return item->type == ContextType::Resolve && static_cast<ResolveContext *>(item)->Matches(name); |
There was a problem hiding this comment.
(FYI: This is Claude) False positive. The lambda at L466-468 is a name-only gather, but the loop body at L474-523 enforces the full scope before rescuing: kind (L482), browseThatCausedResolve (L493), requestedInterfaceId (L497), protocol from addressType (L501), requestedDomain (L505). Mismatch on any of these falls through to a fresh Platform::New<ResolveContext> at L558. The IPv4-then-IPv6 cancel/restart case you describe correctly skips the rescue.
| return item->type == ContextType::Resolve && static_cast<ResolveContext *>(item)->Matches(name); | ||
| }, |
There was a problem hiding this comment.
(FYI: This is Claude) Duplicate of the comment above on the same line — same false positive. The loop at L581-626 filters by kind, browse origin, interface, protocol, and domain; mismatches fall through to fresh allocation at L644-648.
9130cd2 to
98f9f11
Compare
98f9f11 to
0957a4f
Compare
8854af0 to
d2bf7c2
Compare
941f38c to
43a6933
Compare
43a6933 to
eee4b77
Compare
c9ff5c1 to
a4a4ba6
Compare
a4a4ba6 to
69b5133
Compare
|
PR #72273: Size comparison from 8a162c6 to 3d06e81 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
2867bd3 to
b1cd0bb
Compare
9bf56a9 to
9b317d8
Compare
|
PR #72273: Size comparison from 8a162c6 to aca125d Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
42c1b92 to
d2399e1
Compare
|
PR #72273: Size comparison from 8a162c6 to d2399e1 Full report (31 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
d2399e1 to
42a21f2
Compare
|
PR #72273: Size comparison from 8a162c6 to 42a21f2 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
87cf9ee to
0078837
Compare
…en't dropped User-visible failure: on Darwin, every reconnect to a Matter node after ChipDnssdResolveNoLongerNeeded shows ~1s of extra NodeID-resolve latency. Inbound mDNS resolve answers already queued on the dnssd socket get discarded, so the next resolve has to start from scratch. Root cause: when the consumer counter drops to zero we immediately call Finalize -> DNSServiceRefDeallocate. Per the dnssd contract, DNSServiceRefDeallocate discards any events queued on that connection but not yet read. A second observation from the mDNS owner is that "starting and stopping queries doesn't query harder" -- a tight cancel-then-restart for the same instance name is strictly worse than letting the existing query run. Fix: introduce a per-ResolveContext deferred-teardown window (default 500ms) before the actual DNSServiceRefDeallocate. Inside the window: a queued read indicator dispatches the result through DispatchSuccess (which cancels the timer); a new ChipDnssdResolve for the same instance name reuses the existing context and bumps the counter back to 1, skipping DNSServiceCreateConnection / DNSServiceResolve entirely; otherwise the timer fires OnResolveDeferredTeardown -> Finalize(CHIP_ERROR_CANCELLED), preserving the existing failure-path contract upper layers rely on. Carve-out: delegate-based ResolveContexts (callback == nullptr, used by MTRCommissionableBrowser) are NOT subject to deferred teardown. The browser churns OnBrowseAdd/OnBrowseRemove for the same instance name on the order of microseconds while a device is being discovered; holding the underlying DNSServiceRef alive across that churn starves DNSServiceGetAddrInfo of a chance to deliver before the next remove arrives, which manifested as MTRCommissionableBrowserTests/test005 timing out under TSAN. The NodeID reconnect bug this PR fixes is on the callback-based path. Blast radius is confined to the Darwin dnssd platform layer. Tests in src/platform/tests/TestDnssd.cpp pin: - ReusesContextWithinDeferredWindow (callback-based coalescing) - DelegateBasedResolveIsNotDeferred (delegate-based synchronous teardown) - CancelStillPropagatesIfNoInFlightResult (timer fires once if no follow-up) - Multi-sibling rescue, scope mismatch refusal, mismatched-callback rebind, shared-counter invariants, repeated toggle within window, etc. rdar://176263876
0078837 to
6508f97
Compare
Summary
ChipDnssdResolveNoLongerNeededshows ~1s of extra NodeID-resolve latency. Inbound mDNS resolve answers that were already on the dnssd socket get discarded, so the next resolve has to start from scratch.Finalize->DNSServiceRefDeallocate. Per the dnssd contract,DNSServiceRefDeallocatediscards any events queued on that connection but not yet read off the socket. A second observation from the mDNS owner is that "starting and stopping queries doesn't query harder" — a tight cancel-then-restart for the same instance name is strictly worse than letting the existing query run.ResolveContextdeferred-teardown window (default 500ms) before the actualDNSServiceRefDeallocate. Inside the window: a queued read indicator dispatches the result throughDispatchSuccess(which cancels the timer); a newChipDnssdResolvefor the same instance name reuses the existing context and bumps the counter back to 1, skippingDNSServiceCreateConnection/DNSServiceResolveentirely; otherwise the timer firesOnResolveDeferredTeardown->Finalize(CHIP_ERROR_CANCELLED), preserving the existing failure-path contract upper layers rely on. BothDispatchSuccessandDispatchFailurecancel any pending teardown timer so it can never fire against a freed context.Blast radius is confined to the Darwin dnssd platform layer; Linux, ESP32, OpenThread, and Android paths are untouched. The change is behaviorally inert for callers that never re-
ResolveafterResolveNoLongerNeeded(the cancel still propagates, just after the window). Rollback is a single-commit revert.Testing
New
TestDarwinDnssdResolveCoalescetests insrc/platform/tests/TestDnssd.cppexercise the state machine in test time viaSetResolveDeferredTeardownDelay:ReusesContextWithinDeferredWindow—Resolve->ResolveNoLongerNeeded->Resolvewithin the window reuses the sameResolveContext(counter 1 -> 0 -> 1).ReusesContextWithinDeferredWindowDelegate— same as above through the delegate-style API surface.CancelStillPropagatesIfNoInFlightResult— wait > teardown delay with no follow-upResolveactually fires the deferred teardown exactly once, dispatchingCHIP_ERROR_CANCELLED.Note: a fourth planned test that injects a synthetic
OnResolvecallback inside the deferred window is deferred —OnResolvelives in an anonymous namespace and is not addressable from the test target. The equivalent behavior is observable via theDispatchSuccesstimer-cancellation paths exercised by the existing tests plus integration coverage in the discovery layer.