[WIP] Add non-blocking asynchronous bag splitting to Rosbag2 recorder/writer#2382
[WIP] Add non-blocking asynchronous bag splitting to Rosbag2 recorder/writer#2382MichaelOrlov wants to merge 8 commits intorollingfrom
Conversation
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Preserve cache wakeup across bag split flushing When SequentialWriter splits a bag while cache writing is enabled, writes that arrive after CacheConsumer::begin_flushing() can still be accepted into MessageCache::producer_buffer_. The split then flushes the pre-split buffer, opens the new storage, and restarts the consumer thread. The race was that the old consumer path could clear data_ready_ while flushing only the pre-split buffer. If a message was queued into producer_buffer_ during that window, the restarted consumer could go back to sleep even though data was already buffered for the new bagfile. This made writer_with_cache_splits_when_storage_bagfile_size_gt_max_bagfile_size flaky, with fake_storage_size_ sometimes staying at 0 for the first post-split write. Fix MessageCache::done_flushing() to restore readiness when producer_buffer_ still contains messages after flushing completes. This preserves the wakeup for the restarted consumer and ensures post-split cached messages are written to the new storage promptly. Also add a regression test covering the flush handoff case where a message is pushed during flushing and must still wake the consumer after done_flushing(). Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
The cached SequentialWriter split test was still asserting a stronger condition than the implementation guarantees: that the 6th and 11th writes would already be visible as the first write in a newly opened bagfile. That is not stable in cache mode. Split decisions are based on persisted storage size, while writes continue through the cache and bag splitting happens asynchronously. Depending on scheduling, the next split may not be observable at that exact message boundary even though the writer still drains correctly and produces the expected split files. Drop the per-message `fake_storage_size_ == 1` assertion and keep the checks that match the actual contract: - all expected messages are written - the storage is reopened the expected number of times - metadata reports the expected split file set This keeps the regression coverage while removing a flaky timing assumption from the test. Signed-off-by: Michael Orlov <morlovmr@gmail.com>
Signed-off-by: Michael Orlov <morlovmr@gmail.com>
|
@MichaelOrlov I continued working on top of your commits. I pushed our changes to the following branch. rolling...carlos-apex:rosbag2:csv/async-split-nonblocking-upstream Here is a short description and justification for the new commits. This commit changes the order in which pending split timestamps are handled. It could be potentially moved to a follow-up PR with some re-work. This commit addresses a problem where split_bagfile_shared_future_ was not clean-up properly. The future was expected to be cleaned-up as part of The commit does a lazy clean-up at the This commit addresses this TODO by adding a specific mutex. // TODO(morlov): Protect messages_dropped_per_topic_ with mutex since we can call write(msg) concurrently Added new tests to exercise split concurrency This commit addresses a TSAN violation caught by the tests in the commit above |
|
@carlos-apex Thank you for following up and your contribution. |
Description
Add non-blocking asynchronous bag splitting to rosbag2 recorder/writer.
At a high level, the PR changes bag splitting from a blocking storage switch into an asynchronous operation that can run without stalling ongoing message writes. The concurrency work ensures the writer can continue accepting messages while the split is being processed, and the recorder now uses that async path directly.
Change also includes test updates for the async path and removes one timing-sensitive assertion that was no longer reliable with asynchronous execution.
Is this user-facing behavior change?
Yes.
From a user perspective, bag splitting during recording becomes more robust under load.
User-visible impact:
This is most relevant for users:
Backward Compatibility
This change is largely backward compatible at the API and workflow level, but there is an observable behavioral nuance.
Compatible aspects:
Behavioral difference to call out:
So the practical compatibility summary is:
Did you use Generative AI?
Yes. I used Codex gpt-5.4 to help with some tasks and analysis of the problems.
Additional Information
Not backportable due to the API/ABI breaking changes.
This PR depends on the: