Skip to content

[kilted] Address flakiness in the "rosbag2_transport::test_record_services" tests (backport #2368)#2379

Merged
MichaelOrlov merged 2 commits intokiltedfrom
mergify/bp/kilted/pr-2368
Apr 14, 2026
Merged

[kilted] Address flakiness in the "rosbag2_transport::test_record_services" tests (backport #2368)#2379
MichaelOrlov merged 2 commits intokiltedfrom
mergify/bp/kilted/pr-2368

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify bot commented Mar 16, 2026

Description

The rosbag2_transport::test_record_services is known to be flaky on CI run.
Please refer to the RCA in the relevant issue #2367 (comment)

Short description of changes:

  1. Increase the default service_call_timeout_ to 10 seconds.
  2. Use recommended exec->spin_until_future_complete() API in the successful_service_request() implementation using temporary created executor. Note: we can't reuse exec_ because spin_until_future_complete() requires that the node shall not already be spinning.
  3. Removed client_node_ from adding to the "main" exec_ executor.

Is this user-facing behavior change?

No.

Did you use Generative AI?

Yes. Codex 5.4

Additional Information

Can be backported.


This is an automatic backport of pull request #2368 done by [Mergify](https://mergify.com).

…sts (#2368)

* Use exec_->spin_until_future_complete() in successful_service_request()

- Add pub_node_ spinning while waiting for matched.
- Wait longer due to potential service latency after pause/resume
 operations.

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Co-authored-by: ap-apexai <anup.patel@apex.ai>

* Remove client_node_ from adding to the exec_

- Rationale:
 To avoid a race condition and undefined behavior when transitioning
 client node from one executor to another which leads to the another
 flakiness in tests.

It was possible a race condition and UB because:
In setup, client_node_ is first attached to the background executor at
test_record_services.cpp:142, then removed at
test_record_services.cpp:185, and later immediately added to a temporary
executor at test_record_services.cpp:197. In rclcpp, remove_node() only
queues removal; the old executor detaches the node’s callback groups
later when it processes its queue. The new executor, meanwhile, only
auto-adds callback groups if they are not still marked associated with
some executor.
That means this can happen:
1. exec_->remove_node(client_node_) clears the node association early
 at executor_entities_collector.cpp:114, so the temp executor is
 allowed to add_node(client_node_).
2. But the client’s default callback group is still associated with the
 old executor until that old spin thread processes the pending removal.
3. The temp executor never picks up the client waitable, so the service
 response is never taken on the client side.
4. spin_until_future_complete() times out.

It is intermittent because it depends on timing between the old executor
 thread processing the queued removal and the new executor processing
 the queued add. The first request in tests like record_stop is
 especially exposed because it happens right after setup.

Signed-off-by: Michael Orlov <morlovmr@gmail.com>

---------

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Co-authored-by: ap-apexai <anup.patel@apex.ai>
(cherry picked from commit 22a408d)

# Conflicts:
#	rosbag2_transport/test/rosbag2_transport/test_record_services.cpp
@mergify mergify bot added the conflicts label Mar 16, 2026
@mergify
Copy link
Copy Markdown
Author

mergify bot commented Mar 16, 2026

Cherry-pick of 22a408d has failed:

On branch mergify/bp/kilted/pr-2368
Your branch is up to date with 'origin/kilted'.

You are currently cherry-picking commit 22a408d.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   rosbag2_transport/test/rosbag2_transport/test_record_services.cpp

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@MichaelOrlov MichaelOrlov changed the title Address flakiness in the "rosbag2_transport::test_record_services" tests (backport #2368) [kilted] Address flakiness in the "rosbag2_transport::test_record_services" tests (backport #2368) Mar 16, 2026
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Copy link
Copy Markdown
Contributor

@MichaelOrlov MichaelOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with green CI

@MichaelOrlov
Copy link
Copy Markdown
Contributor

Pulls: #2379
Gist: https://gist.githubusercontent.com/MichaelOrlov/7834d8c9d104b05285a2ef620b494365/raw/3a9aed7bb8df9502f19772466276ece4a8b6353d/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_transport
TEST args: --packages-above rosbag2_transport
ROS Distro: kilted
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/18913

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Copy Markdown
Contributor

  • Linux Build Status
  • Linux-rhel Build Status

@MichaelOrlov MichaelOrlov merged commit 64050c3 into kilted Apr 14, 2026
13 checks passed
@MichaelOrlov MichaelOrlov deleted the mergify/bp/kilted/pr-2368 branch April 14, 2026 06:22
@MichaelOrlov
Copy link
Copy Markdown
Contributor

https://github.com/Mergifyio backport jazzy

@mergify
Copy link
Copy Markdown
Author

mergify bot commented Apr 14, 2026

backport jazzy

✅ Backports have been created

Details

mergify bot added a commit that referenced this pull request Apr 14, 2026
…vices" tests (backport #2368) (#2379)

* Address flakiness in the "rosbag2_transport::test_record_services" tests (#2368)

* Use exec_->spin_until_future_complete() in successful_service_request()

- Add pub_node_ spinning while waiting for matched.
- Wait longer due to potential service latency after pause/resume
 operations.

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Co-authored-by: ap-apexai <anup.patel@apex.ai>

* Remove client_node_ from adding to the exec_

- Rationale:
 To avoid a race condition and undefined behavior when transitioning
 client node from one executor to another which leads to the another
 flakiness in tests.

It was possible a race condition and UB because:
In setup, client_node_ is first attached to the background executor at
test_record_services.cpp:142, then removed at
test_record_services.cpp:185, and later immediately added to a temporary
executor at test_record_services.cpp:197. In rclcpp, remove_node() only
queues removal; the old executor detaches the node’s callback groups
later when it processes its queue. The new executor, meanwhile, only
auto-adds callback groups if they are not still marked associated with
some executor.
That means this can happen:
1. exec_->remove_node(client_node_) clears the node association early
 at executor_entities_collector.cpp:114, so the temp executor is
 allowed to add_node(client_node_).
2. But the client’s default callback group is still associated with the
 old executor until that old spin thread processes the pending removal.
3. The temp executor never picks up the client waitable, so the service
 response is never taken on the client side.
4. spin_until_future_complete() times out.

It is intermittent because it depends on timing between the old executor
 thread processing the queued removal and the new executor processing
 the queued add. The first request in tests like record_stop is
 especially exposed because it happens right after setup.

Signed-off-by: Michael Orlov <morlovmr@gmail.com>

---------

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Co-authored-by: ap-apexai <anup.patel@apex.ai>
(cherry picked from commit 22a408d)

# Conflicts:
#	rosbag2_transport/test/rosbag2_transport/test_record_services.cpp

* Address merge conflicts

Signed-off-by: Michael Orlov <morlovmr@gmail.com>

---------

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Co-authored-by: Michael Orlov <morlovmr@gmail.com>
(cherry picked from commit 64050c3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant