[kilted] Address flakiness in the "rosbag2_transport::test_record_services" tests (backport #2368)#2379
Merged
MichaelOrlov merged 2 commits intokiltedfrom Apr 14, 2026
Merged
Conversation
…sts (#2368) * Use exec_->spin_until_future_complete() in successful_service_request() - Add pub_node_ spinning while waiting for matched. - Wait longer due to potential service latency after pause/resume operations. Signed-off-by: Michael Orlov <morlovmr@gmail.com> Co-authored-by: ap-apexai <anup.patel@apex.ai> * Remove client_node_ from adding to the exec_ - Rationale: To avoid a race condition and undefined behavior when transitioning client node from one executor to another which leads to the another flakiness in tests. It was possible a race condition and UB because: In setup, client_node_ is first attached to the background executor at test_record_services.cpp:142, then removed at test_record_services.cpp:185, and later immediately added to a temporary executor at test_record_services.cpp:197. In rclcpp, remove_node() only queues removal; the old executor detaches the node’s callback groups later when it processes its queue. The new executor, meanwhile, only auto-adds callback groups if they are not still marked associated with some executor. That means this can happen: 1. exec_->remove_node(client_node_) clears the node association early at executor_entities_collector.cpp:114, so the temp executor is allowed to add_node(client_node_). 2. But the client’s default callback group is still associated with the old executor until that old spin thread processes the pending removal. 3. The temp executor never picks up the client waitable, so the service response is never taken on the client side. 4. spin_until_future_complete() times out. It is intermittent because it depends on timing between the old executor thread processing the queued removal and the new executor processing the queued add. The first request in tests like record_stop is especially exposed because it happens right after setup. Signed-off-by: Michael Orlov <morlovmr@gmail.com> --------- Signed-off-by: Michael Orlov <morlovmr@gmail.com> Co-authored-by: ap-apexai <anup.patel@apex.ai> (cherry picked from commit 22a408d) # Conflicts: # rosbag2_transport/test/rosbag2_transport/test_record_services.cpp
Author
|
Cherry-pick of 22a408d has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Contributor
|
Pulls: #2379 |
Contributor
Contributor
|
https://github.com/Mergifyio backport jazzy |
Author
✅ Backports have been createdDetails
|
mergify bot
added a commit
that referenced
this pull request
Apr 14, 2026
…vices" tests (backport #2368) (#2379) * Address flakiness in the "rosbag2_transport::test_record_services" tests (#2368) * Use exec_->spin_until_future_complete() in successful_service_request() - Add pub_node_ spinning while waiting for matched. - Wait longer due to potential service latency after pause/resume operations. Signed-off-by: Michael Orlov <morlovmr@gmail.com> Co-authored-by: ap-apexai <anup.patel@apex.ai> * Remove client_node_ from adding to the exec_ - Rationale: To avoid a race condition and undefined behavior when transitioning client node from one executor to another which leads to the another flakiness in tests. It was possible a race condition and UB because: In setup, client_node_ is first attached to the background executor at test_record_services.cpp:142, then removed at test_record_services.cpp:185, and later immediately added to a temporary executor at test_record_services.cpp:197. In rclcpp, remove_node() only queues removal; the old executor detaches the node’s callback groups later when it processes its queue. The new executor, meanwhile, only auto-adds callback groups if they are not still marked associated with some executor. That means this can happen: 1. exec_->remove_node(client_node_) clears the node association early at executor_entities_collector.cpp:114, so the temp executor is allowed to add_node(client_node_). 2. But the client’s default callback group is still associated with the old executor until that old spin thread processes the pending removal. 3. The temp executor never picks up the client waitable, so the service response is never taken on the client side. 4. spin_until_future_complete() times out. It is intermittent because it depends on timing between the old executor thread processing the queued removal and the new executor processing the queued add. The first request in tests like record_stop is especially exposed because it happens right after setup. Signed-off-by: Michael Orlov <morlovmr@gmail.com> --------- Signed-off-by: Michael Orlov <morlovmr@gmail.com> Co-authored-by: ap-apexai <anup.patel@apex.ai> (cherry picked from commit 22a408d) # Conflicts: # rosbag2_transport/test/rosbag2_transport/test_record_services.cpp * Address merge conflicts Signed-off-by: Michael Orlov <morlovmr@gmail.com> --------- Signed-off-by: Michael Orlov <morlovmr@gmail.com> Co-authored-by: Michael Orlov <morlovmr@gmail.com> (cherry picked from commit 64050c3)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
rosbag2_transport::test_record_servicesis known to be flaky on CI run.Please refer to the RCA in the relevant issue #2367 (comment)
Short description of changes:
exec->spin_until_future_complete()API in thesuccessful_service_request()implementation using temporary created executor. Note: we can't reuseexec_becausespin_until_future_complete()requires that the node shall not already be spinning.client_node_from adding to the "main"exec_executor.rosbag2_transport::test_record_servicestests is tend to be flaky #2367Is this user-facing behavior change?
No.
Did you use Generative AI?
Yes. Codex 5.4
Additional Information
Can be backported.
This is an automatic backport of pull request #2368 done by [Mergify](https://mergify.com).