Skip to content

Hotfix/subscription race#1194

Open
FieldSwan wants to merge 2 commits intoRobotWebTools:ros2from
field-ai:hotfix/subscription-race
Open

Hotfix/subscription race#1194
FieldSwan wants to merge 2 commits intoRobotWebTools:ros2from
field-ai:hotfix/subscription-race

Conversation

@FieldSwan
Copy link
Contributor

@FieldSwan FieldSwan commented Mar 18, 2026

Public API Changes
None

Background
rosbridge_server would sometimes partially crash (on the ROS2 side) when multiple clients connected/disconnected or reconnected on bootup:

[INFO] [1773839146.399416586] [rosbridge_websocket]: [Client 706ee77c-025c-4b4c-aed6-cf68aab19bd7] Subscribed to /foo/dashboard/current_path
Exception in thread Thread-1 (spin):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/opt/ros/kilted/lib/python3.12/site-packages/rclpy/executors.py", line 374, in spin
self.spin_once()
File "/opt/ros/kilted/lib/python3.12/site-packages/rclpy/executors.py", line 968, in spin_once
self._spin_once_impl(timeout_sec)
File "/opt/ros/kilted/lib/python3.12/site-packages/rclpy/executors.py", line 951, in _spin_once_impl
handler, entity, node = self.wait_for_ready_callbacks(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ros/kilted/lib/python3.12/site-packages/rclpy/executors.py", line 921, in wait_for_ready_callbacks
return next(self._cb_iter)
^^^^^^^^^^^^^^^^^^^
[INFO] [1773839146.413645247] [rosbridge_websocket]: [Client 706ee77c-025c-4b4c-aed6-cf68aab19bd7] Subscribed to /foo/vehicle_status
File "/opt/ros/kilted/lib/python3.12/site-packages/rclpy/executors.py", line 820, in _wait_for_ready_callbacks
waitable.add_to_wait_set(wait_set)
File "/opt/ros/kilted/lib/python3.12/site-packages/rclpy/event_handler.py", line 176, in add_to_wait_set
with self.__event:
rclpy._rclpy_pybind11.InvalidHandle: cannot use Destroyable because destruction was requested

Description

  1. This PR fixes the issue by scheduling the destruction task on the executor thread instead of destroying subscriptions inside of their own callback.
  2. In my stress-testing (3 clients with high CPU load and high bandwidth load, then reload each client) the fix resulted in subscriptions racing to send to the websocket after it was already down (better than crashing!) which caused delays on the client side for a short while (up to ~30 seconds) as it attempted to clear the queue while some subscriptions were still coming in. Finally when the queue is cleared it accepts new websocket connections and the bridge was functional again. To reduce this issue, we block sending messages out to the websocket as soon as we know the client has disconnected which seems to reduce the worst-case reconnect time to ~5 seconds.

Testing
I have not written a unit test for this, but the replication in general seems to be:

  1. Have multiple clients waiting to connect to the rosbridge_server at the same time.
  2. Due to system CPU stress, some clients timeout and attempt to reconnect
  3. rosbridge_server has a partial crash where new web socket connections may be made, but no responses are ever sent back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant