Darwin: fail commissioning fast when external NOC chain issuer never responds#72271
Darwin: fail commissioning fast when external NOC chain issuer never responds#72271woody-apple wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a watchdog timer (mExternalNOCChainWatchdog) to bound the wait time for external operational-certificate issuers to ~20 seconds, preventing the commissioner from hanging for the full ~67-second MRP exchange timeout if the issuer crashes or hangs. It also adds a corresponding integration test to verify this timeout behavior. The review feedback highlights a critical concurrency issue where a race condition during teardown could lead to a Use-After-Free (UAF) crash when the watchdog timer fires, and suggests using the Matter SDK's System::Layer timer or a weak lifetime token to safely manage the delegate's lifecycle.
| dispatch_queue_t watchdogQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0); | ||
| dispatch_source_t watchdog = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, watchdogQueue); | ||
| dispatch_source_set_timer(watchdog, | ||
| dispatch_time(DISPATCH_TIME_NOW, static_cast<int64_t>(kExternalNOCChainIssuerWatchdogTimeoutSeconds * NSEC_PER_SEC)), | ||
| DISPATCH_TIME_FOREVER, | ||
| static_cast<uint64_t>(0.5 * NSEC_PER_SEC)); | ||
| dispatch_source_set_event_handler(watchdog, ^{ | ||
| dispatch_source_cancel(watchdog); | ||
| // Always log that the watchdog fired, including the controller-gone | ||
| // case, so we can post-hoc tell a hung issuer apart from a healthy | ||
| // teardown. ExternalNOCChainGenerated is safe to invoke from any | ||
| // queue: it hops to the Matter queue via asyncGetCommissionerOnMatterQueue | ||
| // before touching members, and the asyncGet... path no-ops when the | ||
| // controller is shut down. | ||
| MTR_LOG_ERROR("MTROperationalCredentialsDelegate: external NOC chain issuer did not complete within %.0fs; failing SendNOC", | ||
| kExternalNOCChainIssuerWatchdogTimeoutSeconds); | ||
| MTRDeviceController_Concrete * strongController = weakController; | ||
| if (strongController == nil || !strongController.isRunning) { | ||
| MTR_LOG_ERROR("MTROperationalCredentialsDelegate: external NOC chain issuer watchdog fired but controller shut down; abandoning completion"); | ||
| return; | ||
| } | ||
| NSError * timeoutError = [MTRError errorForCHIPErrorCode:CHIP_ERROR_TIMEOUT]; | ||
| this->ExternalNOCChainGenerated(nil, timeoutError); | ||
| }); | ||
| mExternalNOCChainWatchdog = watchdog; | ||
| dispatch_resume(watchdog); |
There was a problem hiding this comment.
🚨 Critical Use-After-Free (UAF) & Retain Cycle in Watchdog Timer
There is a severe race condition that can lead to a Use-After-Free (UAF) crash when the controller is shut down or deallocated while the watchdog timer is firing or about to fire.
1. The Use-After-Free Race Condition
MTROperationalCredentialsDelegateis a C++ object owned byMTRDeviceController_Concrete.- During controller deallocation,
~MTROperationalCredentialsDelegate()is called, which asynchronously cancels the watchdog viadispatch_source_cancel. - However, because
dispatch_source_cancelis asynchronous, if the watchdog timer fires concurrently onwatchdogQueue(a global concurrent queue), the event handler block starts executing. - The block captures
weakController(a weak reference to the controller) andthis(a raw C++ pointer to the delegate). - In Objective-C ARC, C++ destructors (via
.cxx_destruct) run before weak references are cleared during deallocation. Therefore,weakControlleris still non-nil when~MTROperationalCredentialsDelegate()is executing or has just finished. - The block successfully resolves
strongControllerand callsthis->ExternalNOCChainGenerated(...). - Since
this(the delegate) has already been destroyed or is in the process of being destroyed, callingthis->ExternalNOCChainGeneratedresults in a Use-After-Free (UAF) crash.
2. Retain Cycle
Capturing watchdog directly inside the block passed to dispatch_source_set_event_handler(watchdog, ...) creates a strong retain cycle. While dispatch_source_cancel eventually releases the block, it is much safer to use a weak reference to avoid extended lifespans or leaks if cancellation is delayed.
Recommended Solutions
Option A: Use chip::System::Layer Timer (Highly Recommended & Idiomatic)
Instead of using dispatch_source_t and dealing with multi-threading, races, and retain cycles, use the Matter SDK's built-in System::Layer timer. Since the timer runs entirely on the Matter thread/queue, it is completely synchronized with the rest of the delegate's lifecycle.
In MTROperationalCredentialsDelegate.h:
static void OnWatchdogTimeout(chip::System::Layer * systemLayer, void * state);In MTROperationalCredentialsDelegate.mm:
// To start the timer:
DeviceLayer::SystemLayer().StartTimer(
System::Clock::Seconds32(kExternalNOCChainIssuerWatchdogTimeoutSeconds),
OnWatchdogTimeout,
this
);
// To cancel the timer (in destructor and on completion):
DeviceLayer::SystemLayer().CancelTimer(OnWatchdogTimeout, this);
// Implementation:
void MTROperationalCredentialsDelegate::OnWatchdogTimeout(chip::System::Layer * systemLayer, void * state)
{
auto * self = static_cast<MTROperationalCredentialsDelegate *>(state);
MTR_LOG_ERROR("MTROperationalCredentialsDelegate: external NOC chain issuer did not complete within %.0fs; failing SendNOC",
kExternalNOCChainIssuerWatchdogTimeoutSeconds);
NSError * timeoutError = [MTRError errorForCHIPErrorCode:CHIP_ERROR_TIMEOUT];
self->ExternalNOCChainGenerated(nil, timeoutError);
}Option B: Use a Weak Lifetime Token (If keeping dispatch_source_t)
If you must keep dispatch_source_t, introduce a shared lifetime token to safely detect when the C++ delegate has been destroyed. The code suggestion below implements this option (requires adding std::shared_ptr<bool> mIsAlive = std::make_shared<bool>(true); to the header and setting *mIsAlive = false; in the destructor).
dispatch_queue_t watchdogQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
dispatch_source_t watchdog = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, watchdogQueue);
dispatch_source_set_timer(watchdog,
dispatch_time(DISPATCH_TIME_NOW, static_cast<int64_t>(kExternalNOCChainIssuerWatchdogTimeoutSeconds * NSEC_PER_SEC)),
DISPATCH_TIME_FOREVER,
static_cast<uint64_t>(0.5 * NSEC_PER_SEC));
std::weak_ptr<bool> weakIsAlive = mIsAlive;
__weak dispatch_source_t weakWatchdog = watchdog;
dispatch_source_set_event_handler(watchdog, ^{
dispatch_source_t strongWatchdog = weakWatchdog;
if (strongWatchdog) {
dispatch_source_cancel(strongWatchdog);
}
auto pin = weakIsAlive.lock();
if (!pin || !*pin) {
return;
}
MTR_LOG_ERROR("MTROperationalCredentialsDelegate: external NOC chain issuer did not complete within %.0fs; failing SendNOC",
kExternalNOCChainIssuerWatchdogTimeoutSeconds);
MTRDeviceController_Concrete * strongController = weakController;
if (strongController == nil || !strongController.isRunning) {
MTR_LOG_ERROR("MTROperationalCredentialsDelegate: external NOC chain issuer watchdog fired but controller shut down; abandoning completion");
return;
}
NSError * timeoutError = [MTRError errorForCHIPErrorCode:CHIP_ERROR_TIMEOUT];
this->ExternalNOCChainGenerated(nil, timeoutError);
});
mExternalNOCChainWatchdog = watchdog;
dispatch_resume(watchdog);
There was a problem hiding this comment.
Pull request overview
This PR improves Darwin commissioning behavior when an external MTROperationalCertificateIssuer hangs by adding a watchdog timeout for NOC chain generation, and adds an integration test to ensure commissioning fails quickly (and safely ignores late issuer completions).
Changes:
- Add a ~20s dispatch-source watchdog around the external NOC chain issuer path, surfacing
CHIP_ERROR_TIMEOUTinstead of waiting for the longer MRP exchange timeout. - Add teardown-safe watchdog cancellation (including delegate destruction) and additional logging for duplicate/late completions.
- Add an integration test that simulates an issuer that never calls its completion and asserts fast timeout + late-completion no-op behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/darwin/Framework/CHIPTests/MTROperationalCertificateIssuerTests.m | Adds a new integration test exercising the “issuer never completes” watchdog timeout and late-completion safety. |
| src/darwin/Framework/CHIP/MTROperationalCredentialsDelegate.mm | Implements the external issuer watchdog timer and related cancellation/logging logic. |
| src/darwin/Framework/CHIP/MTROperationalCredentialsDelegate.h | Adds a watchdog dispatch source member and a non-trivial destructor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // CHIP_ERROR_TIMEOUT through the same completion path. Cancelled (and | ||
| // released) by ExternalNOCChainGenerated when the real completion arrives | ||
| // first. | ||
| dispatch_source_t _Nullable mExternalNOCChainWatchdog; |
There was a problem hiding this comment.
(FYI: This is Claude) Strictly speaking this is a false positive — mExternalNOCChainWatchdog is a strong ObjC dispatch_source_t ivar (_Nullable), and ObjC ivars of object type are zero-initialized to nil by the runtime, so the destructor's if (mExternalNOCChainWatchdog != nil) guard is well-defined even before any arm. That said, I agree the explicit initializer is clearer and matches the style of mOnNOCCompletionCallback = nullptr next to it. I've added = nil for both mExternalNOCChainWatchdog and mWatchdogQueue in the latest amend.
| if (mOnNOCCompletionCallback == nullptr) { | ||
| // Either the watchdog already drove a timeout completion, or | ||
| // we are seeing a second invocation from a misbehaving issuer. | ||
| // Both are no-ops here; log so post-hoc we can tell whether a | ||
| // hung issuer eventually responded after the watchdog fired. | ||
| MTR_LOG_ERROR("MTROperationalCredentialsDelegate: ExternalNOCChainGenerated callback already consumed; ignoring (error=%@)", error); | ||
| return; | ||
| } | ||
|
|
||
| // Cancel the issuer watchdog so it cannot fire after we've already | ||
| // dispatched the real completion. Idempotent against the case | ||
| // where the watchdog itself drove us here (it cancels itself in | ||
| // its event handler before re-entering). | ||
| if (mExternalNOCChainWatchdog != nil) { | ||
| dispatch_source_cancel(mExternalNOCChainWatchdog); | ||
| mExternalNOCChainWatchdog = nil; | ||
| } |
There was a problem hiding this comment.
(FYI: This is Claude) Validated against the current code: this is a false positive in practice. The watchdog handler (lines ~301-331) self-cancels and clears mExternalNOCChainWatchdog = nil on its own teardown path before re-entering ExternalNOCChainGenerated. So the only way to reach the early-return at line 347 with the timeout already consumed is the watchdog-fired path, in which case mExternalNOCChainWatchdog is already nil — there's no stale timer being left armed. The other early-return reason ("second invocation from a misbehaving issuer") arrives after a real completion already cancelled the watchdog at line 360-363 of the prior call, so again no stale timer. Leaving the cancel after the early-return keeps the invariant "watchdog is cancelled exactly when we transition mOnNOCCompletionCallback from non-null to null" tight; pulling it earlier would split that invariant. Happy to move it if you still prefer, but I don't see a reachable stale-timer path.
| // Arm a watchdog so a 3rd-party issuer that crashes or hangs cannot stall | ||
| // SendNOC for the full MRP-inflated CommandSender exchange timeout (~67s). | ||
| // The timer is created suspended, configured, then resumed; the handler | ||
| // simply synthesizes a timeout error through ExternalNOCChainGenerated, | ||
| // which already self-guards against double-callback via | ||
| // mOnNOCCompletionCallback == nullptr. |
There was a problem hiding this comment.
(FYI: This is Claude) Validated and this is a false positive given the threading model. ExternalGenerateNOCChain runs under assertChipStackLockedByCurrentThread() (line 195) — i.e. on the Matter work queue, holding the stack lock. The issuer block dispatched at line 251 ultimately delivers completion via ExternalNOCChainGenerated, which only does [mWeakController asyncGetCommissionerOnMatterQueue:^...] (line 343-344) — its real work runs on the Matter queue. Even if the consumer's issuer queue is concurrent and synthesizes an instant completion, that inner block cannot start executing on the Matter queue until ExternalGenerateNOCChain returns and releases it. The watchdog arm/store/resume at lines 296-333 all complete before that release. So no race: the watchdog is always armed before any inner completion can run. The dispatch_async path is required precisely because we cannot call into a consumer-supplied issuer while holding the Matter stack lock.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #72271 +/- ##
=======================================
Coverage 55.52% 55.52%
=======================================
Files 1630 1630
Lines 111127 111127
Branches 13418 13418
=======================================
Hits 61706 61706
Misses 49421 49421 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
PR #72271: Size comparison from adba29b to cc3e9ab Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
cc3e9ab to
8ec659d
Compare
|
PR #72271: Size comparison from 4894db9 to a6dd6a3 Full report (16 builds for cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
a6dd6a3 to
b4f375f
Compare
b4f375f to
73d722c
Compare
|
PR #72271: Size comparison from 8a162c6 to d85c2aa Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
9c39270 to
7b38710
Compare
|
PR #72271: Size comparison from 8a162c6 to 0b3fe3b Full report (34 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
0b3fe3b to
f00f1dc
Compare
|
PR #72271: Size comparison from 8a162c6 to 19c0277 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
19c0277 to
b22a551
Compare
|
PR #72271: Size comparison from 8a162c6 to 200ca13 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
200ca13 to
cc84352
Compare
|
PR #72271: Size comparison from 8a162c6 to d87832d Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
2b58a0a to
ea91ae0
Compare
|
PR #72271: Size comparison from 8a162c6 to ea91ae0 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
ea91ae0 to
ecfc01b
Compare
When a controller is configured with an MTROperationalCertificateIssuer (typically a 3rd-party companion app that issues NOCs on behalf of an ecosystem), the SDK currently awaits the issuer's completion block indefinitely. If the issuer (or the process hosting it) crashes or hangs mid-commissioning, SendNOC stalls until CommandSender's MRP-inflated exchange timeout fires (~67s in practice), pushing total commissioning time past most user-facing watchdogs and leaving the user with a generic "could not add accessory" error. Add a 20-second dispatch_source watchdog around the external issuer dispatch in ExternalGenerateNOCChain. On expiration we synthesize a CHIP_ERROR_TIMEOUT through the same ExternalNOCChainGenerated path; the existing mOnNOCCompletionCallback == nullptr guard there gives us idempotency against a real completion racing the watchdog. When the issuer responds normally, ExternalNOCChainGenerated cancels and releases the timer so there is no observable behavior change for healthy issuers. The 20s budget leaves ~40s of headroom under the 60s default failsafe and is well above any realistic round-trip for a healthy issuer. Run the watchdog on the Matter work queue and add a dispatch_sync fence to the destructor so a controller shutdown that races a hung issuer cannot use-after-free the delegate from an in-flight handler block, and cannot trip libdispatch's "Release of a source that has not been cancelled" trap if the source is still armed when the delegate is freed. Tests: - testExternalNOCChainGeneration_IssuerNeverCompletes_FiresWatchdogTimeout pins the user-visible behavior: a hung issuer surfaces MTRErrorCodeTimeout to the controller delegate within ~20s, and a late completion is a no-op. - testExternalNOCChainGeneration_ShutdownWhileWatchdogArmed_DoesNotTrap pins the destructor-vs-watchdog teardown path: shutting down the controller while the watchdog dispatch_source is armed-but-unfired must not trap libdispatch and must not UAF the delegate from the handler block. - testExternalNOCChainGeneration_RepeatedShutdownDuringWatchdogWindow_NoLibdispatchTrap is a stress / long-running variant that flushes out residual handler-vs-destructor races across repeated arm/teardown cycles. - testExternalNOCChainGeneration_FactoryStopWhileWatchdogArmed_DoesNotTrap exercises the factory-stop teardown path (no explicit [controller shutdown]) so the destructor's cancel + Matter-queue fence is also covered when teardown is driven purely by reference counting. - testExternalNOCChainGeneration_IssuerRespondsJustAfterWatchdogFires_NoDoubleCallback pins the post-fire double-callback guard: a near-simultaneous issuer completion arriving immediately after the watchdog fires is dropped via the mOnNOCCompletionCallback == nullptr guard, with no second commissioningComplete callback and no crash.
ecfc01b to
6b917c7
Compare
|
PR #72271: Size comparison from 8a162c6 to 6b917c7 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
Summary
MTROperationalCertificateIssuer(typically a 3rd-party companion app that issues NOCs on behalf of an ecosystem), the SDK currently awaits the issuer's completion block indefinitely. If the issuer (or the process hosting it) crashes or hangs mid-commissioning,SendNOCstalls untilCommandSender's MRP-inflated exchange timeout fires (~67s in practice), pushing total commissioning time past most user-facing watchdogs and leaving the user with a generic "could not add accessory" error.dispatch_sourcewatchdog around the external issuer dispatch inExternalGenerateNOCChain. On expiration we synthesize aCHIP_ERROR_TIMEOUTthrough the sameExternalNOCChainGeneratedpath; the existingmOnNOCCompletionCallback == nullptrguard there gives us idempotency against a real completion racing the watchdog.ExternalNOCChainGeneratedcancels and releases the timer, so there is no observable behavior change for healthy issuers. The 20s budget leaves ~40s of headroom under the 60s default failsafe and is well above any realistic round-trip for a healthy issuer.dispatch_syncfences to the delegate destructor so a controller shutdown that races a hung issuer cannot use-after-free the delegate from an in-flight handler, and cannot trip libdispatch's "Release of a source that has not been cancelled" trap. The watchdog queue (GetWorkQueue()) is the same queue the factory hands every controller as its_chipWorkQueue, so the two fences drain both the watchdog handler and the single depth-1asyncGetCommissionerOnMatterQueueinner block; the destructorVerifyOrDie-traps if it is ever entered on that queue with the watchdog still armed (off-queue teardown is the contract today).Scope / deferred follow-up
This PR deliberately does not change the long-standing silent no-op on the
mCppCommissioner != commissionerbranch ofExternalNOCChainGenerated. An earlier revision made that branch invokeonCompletion->mCall(CHIP_ERROR_INCORRECT_STATE, …); that is an independent, currently-unproven behavior change (the callback context can belong to a torn-down commissioner — a plausible UAF) with nothing to do with the watchdog, so it has been reverted to the original barereturnand left for a separate change with its own test + justification.Testing
Four end-to-end tests in
MTROperationalCertificateIssuerTests.m, all run against theall-clusters-appover real commissioning:testExternalNOCChainGeneration_IssuerNeverCompletes_FiresWatchdogTimeout— drives commissioning with an issuer that captures but never invokes its completion. Asserts the controller delegate seesMTRErrorCodeTimeout, with the elapsed-time bound anchored at issuer-invoke (~20s window) so an earlier-phase failure cannot satisfy it; impossible pre-fix, which waits ~67s. Also invokes the captured completion late to confirm the swallow path is a no-op (no crash / double-complete).testExternalNOCChainGeneration_ShutdownWhileWatchdogArmed_DoesNotTrap— tears down via explicit[controller shutdown]while the watchdog source is armed-but-unfired. Pre-fix this trips libdispatch's source-release trap and/or UAFs the delegate; post-fix the destructor cancels + Matter-queue-fences beforethisis freed.testExternalNOCChainGeneration_FactoryStopWhileWatchdogArmed_DoesNotTrap— same teardown safety on the distinct factory-stop / refcount-driven path (no explicitshutdowncall), which has historically had different Matter-queue-drain ordering.testExternalNOCChainGeneration_DoubleInvocationByMisbehavingIssuer_NoCrashNoDoubleCallback— a misbehaving issuer invokes its completion twice; pins themOnNOCCompletionCallback == nullptrguard (no double-callback, no UAF on the chipCallback). Fast (sub-watchdog-window).The UAF / "does not trap" assertions surface hard libdispatch traps unconditionally; the use-after-free detection additionally relies on ASan/TSan/leaks being enabled in CI. The pre-existing
testFailedCertificateIssuance(healthy issuer) continues to pass — the watchdog is cancelled on the normal completion path, so healthy issuers see no behavior change.(An earlier revision carried 11 tests, including several that waited the full ~20s watchdog window, a 25s+ post-shutdown spin, and a self-skipping repeated-shutdown stress loop. Those were dropped as slow / redundant / flaky-by-design; the four above are the load-bearing set.)