Skip to content

Reduce flakiness in rosbag2 recorder end-to-end tests#2370

Open
MichaelOrlov wants to merge 1 commit intorollingfrom
morlov/address-flakiness-in-recorder-end-to-end-tests
Open

Reduce flakiness in rosbag2 recorder end-to-end tests#2370
MichaelOrlov wants to merge 1 commit intorollingfrom
morlov/address-flakiness-in-recorder-end-to-end-tests

Conversation

@MichaelOrlov
Copy link
Copy Markdown
Contributor

Description

Reduced flakiness in rosbag2 recorder tests by strengthening PublicationManager::wait_for_matched().

Please see RCA in the relevant issue #2369 (comment)

The helper PublicationManager::wait_for_matched() now waits on ROS graph events rather than relying solely on fixed sleep polling. It still preserves the existing API and boolean return behavior, but uses a hybrid loop:
• check the publisher’s current subscription count
• wait for a graph change
• re-check periodically with a short bounded wait slice
This improves readiness detection for recorder subscriptions that appear asynchronously during topic discovery, while still handling cases where graph notifications are delayed or not sufficient on their own.
No test call sites were changed. Existing recorder tests continue to use pub_manager.wait_for_matched(...), but they now benefit from the stronger synchronization automatically.

Is this user-facing behavior change?

No.

Did you use Generative AI?

Yes. Codex gpt-5.4

Additional Information

Can be backported.

Reduced flakiness in rosbag2 recorder tests by strengthening
PublicationManager::wait_for_matched().

The helper  `PublicationManager::wait_for_matched()` now waits on ROS
graph events instead of relying only on fixed sleep-polling.
It still preserves the existing API and boolean return behavior, but
uses a hybrid loop:
• check the publisher’s current subscription count
• wait for a graph change
• re-check periodically with a short bounded wait slice
This improves readiness detection for recorder subscriptions that appear
asynchronously during topic discovery, while still handling cases where
graph notifications are delayed or not sufficient on their own.
No test call sites were changed. Existing recorder tests continue to use
pub_manager.wait_for_matched(...), but they now benefit from the
stronger synchronization automatically.

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
@MichaelOrlov MichaelOrlov marked this pull request as ready for review March 14, 2026 00:29
@MichaelOrlov
Copy link
Copy Markdown
Contributor Author

Pulls: #2370
Gist: https://gist.githubusercontent.com/MichaelOrlov/7c557018ca87702bacf9a206effb0f6f/raw/18cd6f3e5455a3948b9802cd9f636114a798356e/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_tests
TEST args: --packages-above rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/18467

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Copy link
Copy Markdown
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a minor comment which does not block this PR.

} while ((clock::now() - start) < timeout);

return false;
auto remaining = std::chrono::duration_cast<std::chrono::nanoseconds>(deadline -
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the while (clock::now() < deadline) check at the top of the loop and the deadline - clock::now() computation a few lines later, time passes. If the deadline is hit in that narrow window, remaining becomes negative, which when cast to std::chrono::nanoseconds (a signed type) would be a negative duration.

probably we would want to add if (remaining <= std::chrono::nanoseconds::zero()) break;?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The "test_rosbag2_record_end_to_end record_end_to_end_test_with_zstd_file_compression" is tend to be flaky

3 participants