Skip to content

Fix race condition in websocket test#1185

Merged
bjsowa merged 2 commits intoRobotWebTools:ros2from
field-ai:hotfix/testing-race
Mar 17, 2026
Merged

Fix race condition in websocket test#1185
bjsowa merged 2 commits intoRobotWebTools:ros2from
field-ai:hotfix/testing-race

Conversation

@FieldSwan
Copy link
Contributor

@FieldSwan FieldSwan commented Mar 16, 2026

Public API Changes
None

Description
This fixes an issue with the websocket testing instrumentation which had a race condition inherent in it that was causing my test to sometimes fail incorrectly (false negative) which led me to believe that other solutions were helping with other issues (they were not).

Changes:

  1. Fix websocket testing instrumentation race condition when many clients requested at once. You can verify this fix by trying this before and after the fix:
colcon test --packages-select rosbridge_server --ctest-args -R "test_startup_race" --repeat until-fail:500 --event-handlers console_direct+ 2>&1 | tee /tmp/startup_race_retry.txt
# in another terminal:
stress --cpu 15

Eventually this should fail, especially with stress running at the same time (load your CPU as much as possible). The fix was to use the reactorThread for sending messages and to add retries to getting the ros port parameter from the ros node.
3. Removes the startup_race.test.py test which, it turns out, was mostly exercising the websocket test instrumentation and not any particularly useful code paths.
4. Reverts the previous "fix" I made to the server instantiation order. Upon further examination, I realized that the order does not matter since the websocket requests are not processed in rosbridge_websocket.py until we call await stop_event.wait(), which is already after the ros2 executor and node are all started. I can remove this change if we want to keep the order of instantiation I had changed to earlier (either way should work fine).

@FieldSwan FieldSwan changed the title Hotfix/testing race Fix race condition in websocket test Mar 16, 2026
@bjsowa
Copy link
Member

bjsowa commented Mar 16, 2026

You sure you want to remove the startup_race test? Didn't it help you detect the race condition you are trying to fix now?

@ikwilnaarhuisman
Copy link

ikwilnaarhuisman commented Mar 16, 2026

Tested this code, doesn't fix the underlying issue mentioned in. #1144

Happens because the RCLPY (SingleThreaded)Executor may crash, which then still throws [rosbridge_websocket-11] rclpy._rclpy_pybind11.RCLError: Failed to add 'rcl_action_client_t' to wait set: action client pointer is invalid, at ./src/rcl_action/action_client.c:510

Me and @YannickdeHoop are working on an actual fix which uses the executor directly so you may never have race conditions for your results ever again: https://github.com/eurogroep/rosbridge_suite/tree/fix/use-executor-callback-queue

@bjsowa LMK what you think of this.

@bjsowa
Copy link
Member

bjsowa commented Mar 17, 2026

@bjsowa LMK what you think of this.

A test, demonstrating where the underlying issue is, would be helpful.

@FieldSwan
Copy link
Contributor Author

FieldSwan commented Mar 17, 2026

You sure you want to remove the startup_race test? Didn't it help you detect the race condition you are trying to fix now?

It did help me identify a race condition, but specifically the race condition located in rosbridge_server/test/websocket/common.py, not in the functional code.

The test does not capture the original deadlock condition I found; rosbridge_library/test/internal/test_ros_loader_concurrent.py does though. There's technically no harm in leaving the test in so I could go either way on this, just wanted it to be clear that it's more similar to a smoke test that happened to catch the test instrumentation randomly than a functional test that is deterministically finding future issues.

@bjsowa bjsowa merged commit 85ff1cd into RobotWebTools:ros2 Mar 17, 2026
2 checks passed
@FieldSwan
Copy link
Contributor Author

FieldSwan commented Mar 17, 2026

Tested this code, doesn't fix the underlying issue mentioned in. #1144

Happens because the RCLPY (SingleThreaded)Executor may crash, which then still throws [rosbridge_websocket-11] rclpy._rclpy_pybind11.RCLError: Failed to add 'rcl_action_client_t' to wait set: action client pointer is invalid, at ./src/rcl_action/action_client.c:510

Me and @YannickdeHoop are working on an actual fix which uses the executor directly so you may never have race conditions for your results ever again: https://github.com/eurogroep/rosbridge_suite/tree/fix/use-executor-callback-queue

@bjsowa LMK what you think of this.

@ikwilnaarhuisman did you try PR #1183 ? This is the fix I suggested could fix Issue #1144

@RobotWebTools RobotWebTools deleted a comment from mergify bot Mar 17, 2026
@RobotWebTools RobotWebTools deleted a comment from mergify bot Mar 17, 2026
@bjsowa
Copy link
Member

bjsowa commented Mar 17, 2026

@Mergifyio backport kilted

@mergify
Copy link

mergify bot commented Mar 17, 2026

backport kilted

✅ Backports have been created

Details

mergify bot pushed a commit that referenced this pull request Mar 17, 2026
* Integration test instrumentation fix

* Revert unnecessary startup fix and remove corresponding always passing test

(cherry picked from commit 85ff1cd)
bjsowa pushed a commit that referenced this pull request Mar 17, 2026
* Integration test instrumentation fix

* Revert unnecessary startup fix and remove corresponding always passing test

(cherry picked from commit 85ff1cd)

Co-authored-by: FieldSwan <michael.swan@fieldai.com>
@bjsowa
Copy link
Member

bjsowa commented Mar 17, 2026

@Mergifyio backport jazzy

@mergify
Copy link

mergify bot commented Mar 17, 2026

backport jazzy

✅ Backports have been created

Details

Cherry-pick of 85ff1cd has failed:

On branch mergify/bp/jazzy/pr-1185
Your branch is up to date with 'origin/jazzy'.

You are currently cherry-picking commit 85ff1cd.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   rosbridge_server/scripts/rosbridge_websocket.py
	modified:   rosbridge_server/test/websocket/common.py

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   rosbridge_server/CMakeLists.txt
	deleted by them: rosbridge_server/test/websocket/startup_race.test.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

mergify bot pushed a commit that referenced this pull request Mar 17, 2026
* Integration test instrumentation fix

* Revert unnecessary startup fix and remove corresponding always passing test

(cherry picked from commit 85ff1cd)

# Conflicts:
#	rosbridge_server/CMakeLists.txt
#	rosbridge_server/test/websocket/startup_race.test.py
bjsowa added a commit that referenced this pull request Mar 17, 2026
* Integration test instrumentation fix

* Revert unnecessary startup fix and remove corresponding always passing test

(cherry picked from commit 85ff1cd)

# Conflicts:
#	rosbridge_server/CMakeLists.txt
#	rosbridge_server/test/websocket/startup_race.test.py

* Fix conflicts

---------

Co-authored-by: FieldSwan <michael.swan@fieldai.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants