Skip to content

Deterministic test proving WorkQueue destructor deadlock#146

Open
bghgary wants to merge 1 commit intoBabylonJS:mainfrom
bghgary:repro/appruntime-deadlock-v2
Open

Deterministic test proving WorkQueue destructor deadlock#146
bghgary wants to merge 1 commit intoBabylonJS:mainfrom
bghgary:repro/appruntime-deadlock-v2

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 2, 2026

Deterministic test proving the AppRuntime destructor has a deadlock race condition.

The Bug

WorkQueue::~WorkQueue() cancels from the main thread via cancel() + notify_all(). The notify fires without holding the queue mutex, so if the worker thread hasn't entered condition_variable::wait() yet, the signal is lost and join() hangs forever.

How This Test Works

Uses arcana testing hooks (microsoft/arcana.cpp#59) to sleep while holding the queue mutex right before wait(). This guarantees the worker is between the condition check and wait() when the destructor fires.

  • Current code (broken): cancel() + notify_all() don't need the mutex, so they fire and the notification is lost while the worker sleeps. Worker enters wait() with no pending signal. Deadlock.
  • Fixed code: Merges WorkQueue into AppRuntime and dispatches cancellation as a work item via Append(). push() needs the mutex, so it blocks until the worker enters wait(). Push completes, notifies, worker wakes. No deadlock.

Expected Result

This PR should FAIL CI -- the test will detect the deadlock via a 5-second timeout on iteration 0.

Changes

All additive, no production code modified:

  • CMakeLists.txt: Point arcana at fork with testing hooks
  • Tests/UnitTests/CMakeLists.txt: Enable ARCANA_TESTING_HOOKS for test target
  • Tests/UnitTests/Android/.../CMakeLists.txt: Enable ARCANA_TESTING_HOOKS for Android test target
  • Tests/UnitTests/Shared/Shared.cpp: Add DestroyDoesNotDeadlock test (guarded by ARCANA_TESTING_HOOKS)
  • Tests/UnitTests/Win32/App.cpp: Pass argc/argv to gtest for --gtest_filter support

Dependencies

@bghgary bghgary force-pushed the repro/appruntime-deadlock-v2 branch from 2cea4fa to 421a7ba Compare April 2, 2026 18:19
@bghgary bghgary force-pushed the repro/appruntime-deadlock-v2 branch 3 times, most recently from d806b08 to fe9c37f Compare April 2, 2026 18:55
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary force-pushed the repro/appruntime-deadlock-v2 branch 2 times, most recently from 2c0dcab to 6763017 Compare April 2, 2026 20:33
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary force-pushed the repro/appruntime-deadlock-v2 branch 2 times, most recently from 41ade60 to 3c4d787 Compare April 2, 2026 22:45
@bghgary bghgary marked this pull request as ready for review April 2, 2026 22:55
Copilot AI review requested due to automatic review settings April 2, 2026 22:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a deterministic unit test that reproduces a WorkQueue/AppRuntime destructor deadlock by leveraging Arcana “before wait” testing hooks, and updates the Windows unit-test entrypoint to support standard gtest CLI arguments.

Changes:

  • Switches the arcana.cpp FetchContent dependency to a fork that includes testing hooks and defines ARCANA_TESTING_HOOKS when tests are enabled.
  • Adds a new AppRuntime.DestroyDoesNotDeadlock test (guarded by ARCANA_TESTING_HOOKS) that attempts to detect the destructor deadlock via timeouts.
  • Updates Win32 unit test main to pass argc/argv into GoogleTest initialization and run tests via RUN_ALL_TESTS().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
CMakeLists.txt Points Arcana dependency to a fork and globally defines ARCANA_TESTING_HOOKS under JSRUNTIMEHOST_TESTS.
Tests/UnitTests/Shared/Shared.cpp Adds a new deterministic deadlock-repro test using Arcana testing hooks.
Tests/UnitTests/Win32/App.cpp Updates test entrypoint to initialize gtest with argc/argv and run tests directly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bghgary bghgary force-pushed the repro/appruntime-deadlock-v2 branch 2 times, most recently from 89322f6 to 9190b96 Compare April 2, 2026 23:26
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit that referenced this pull request Apr 7, 2026
Fix the WorkQueue destructor deadlock by using a mutex-coordinated
wake-up.

## The Bug

WorkQueue::~WorkQueue() cancelled from the main thread via cancel() +
notify_all(). The notify fired **without holding the queue mutex**, so
if the worker thread hadn't entered condition_variable::wait() yet, the
signal was lost and join() hung forever. See #146 for a deterministic
repro of the deadlock against the old code.

```
Main Thread                          Worker Thread
                                     internal_drain():
                                       lock(mutex)
                                       check: empty? -> yes
                                       check: cancelled? -> no
                                       unlock(mutex)  <- about to call wait()
                                            |
~WorkQueue():                          RACE WINDOW
  cancel()                            (no mutex held)
  notify_all() ---- LOST! --------->
                                            |
                                       lock(mutex)
                                       wait(lock) ---- blocks forever

  join() ----------------------------- blocks forever
       |
   DEADLOCK
```

## The Fix

Cancel immediately (so pending work is dropped promptly), then append a
no-op work item to wake the worker. The no-op goes through push() which
acquires the same mutex that wait() releases, so the notification cannot
be lost.

```
Main Thread                          Worker Thread
                                     internal_drain():
                                       lock(mutex)
                                       check: empty? -> yes
                                       wait(lock) ---- releases mutex
                                            |
~WorkQueue():
  cancel()
  Append(no-op):
    push():
      lock(mutex) <---- acquires it (worker is in wait)
      queue.push(no-op)
      unlock(mutex)
      notify_one() ------------------> wakes up!
                                            |
                                       drain -> execute no-op
                                       loop check: cancelled? -> yes
                                       exit loop
  join() <----------------------------- thread exits
       |
   SUCCESS
```

Also fixes member declaration order in AppRuntime.h so m_options
outlives m_workQueue during destruction.

## Regression Test

Includes a deterministic test using arcana testing hooks
(microsoft/arcana.cpp#59, merged) that sleeps while holding the queue
mutex before wait(). This guarantees the worker is in the vulnerable
window when destruction fires. The test **passes** with this fix and
**deadlocks** with the old code (#146).

## Follow-up

Merging WorkQueue into AppRuntime will be done in a separate PR to
eliminate split-lifetime issues.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 7, 2026

Superseded by #147 which is now merged. This PR served its purpose demonstrating the deadlock with the old code.

@bghgary bghgary closed this Apr 7, 2026
@bghgary bghgary reopened this Apr 9, 2026
Reverts the deadlock fix to demonstrate the test detects it.
Test updated with cleanup from BabylonJS#151.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary force-pushed the repro/appruntime-deadlock-v2 branch from 0d21742 to 15bb2f1 Compare April 9, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants