Memory "leak"

## Description

We observed that after hours of operation, the node was killed by OOM-killer. I then tried to figure out where the problem is, and a MWE could reproduce it, showing it was not a fault in our code. I tried a simple node to see if the problem was in the RMW, but just a node does not reproduce the problem, only when the `rosbag2_transport::Recorder` class is used.

### Expected Behavior

The node should free all memory it no longer uses.

### Actual Behavior

The node keeps climbing on memory usage. Valgrind was not able to find the problem, but running a simple node we can replicate the issue.

## To Reproduce

It may take a bit of retrying, but:

```cpp
#include <rosbag2_cpp/writer.hpp>
#include <rosbag2_storage/storage_options.hpp>
#include <rosbag2_transport/record_options.hpp>
#include <rosbag2_transport/recorder.hpp>

using namespace std::chrono_literals;

static const rosbag2_storage::StorageOptions STORAGE_OPTIONS{
    .uri                    = "/tmp/bag_test",
    .storage_id             = "mcap",
    .max_bagfile_duration   = 10,
};

static const rosbag2_transport::RecordOptions RECORD_OPTIONS{
    .all_topics                   = true,
    .all_services                 = true,
    .rmw_serialization_format     = "cdr",
    .topic_polling_interval       = 100ms,
};

class MemSubNode : public rosbag2_transport::Recorder {
   public:
    explicit MemSubNode(const rclcpp::NodeOptions& options = rclcpp::NodeOptions())
        : rosbag2_transport::Recorder(std::make_shared<rosbag2_cpp::Writer>(),
                                      STORAGE_OPTIONS,
                                      RECORD_OPTIONS,
                                      "vb_mem_sub",
                                      options) {
        record();
    }
};

int main(int argc, char* argv[]) {
    rclcpp::init(argc, argv);
    rclcpp::spin(std::make_shared<MemSubNode>());
    rclcpp::shutdown();
}
```

`pub_big.py`:
```python
import rclpy
import os
import sys
from rclpy.node import Node
from std_msgs.msg import ByteMultiArray

class BigPub(Node):
    def __init__(self):
        super().__init__("big_pub")
        self.pub = self.create_publisher(ByteMultiArray, "/big_topic", 10)
        for i in range(100):
            pub = self.create_publisher(ByteMultiArray, f"/many/t{i}", 1)
            msg = ByteMultiArray()
            msg.data = bytearray(os.urandom(16))
            pub.publish(msg)

        msg = ByteMultiArray()

        for _ in range(512):
            msg.data = bytearray(os.urandom(int(sys.argv[1]) * 1024**2))
            self.pub.publish(msg)
        self.get_logger().info("Published big message")
        exit(0)

rclpy.init()
rclpy.spin(BigPub())
```

`pub_small.py`:
```python
import rclpy
import os
import sys
from rclpy.node import Node
from std_msgs.msg import ByteMultiArray

class BigPub(Node):
    def __init__(self):
        super().__init__("small_pub")
        self.pub = self.create_publisher(ByteMultiArray, f"/small_topic", 1)

        n = int(sys.argv[1])
        msg = ByteMultiArray()
        for _ in range(n):
            msg.data = bytearray(os.urandom(64))
            self.pub.publish(msg)
        self.get_logger().info("Published small message")
        exit(0)

rclpy.init()
rclpy.spin(BigPub())
```

Compile the node. I called it `vb_mem_sub`, and then in one terminal I run it and in another I watch its memory usage with (or you can just use `[h]top`):

```bash
while true; do
      ps --no-headers eo comm,rss,%mem --pid $(pidof vb_mem_sub)
      sleep 1
done
```

In another terminal, run `python3 pub_small.py 1000000` and in another run `python3 pub_big.py 10; python3 pub_big.py 1`. You may need to stop and relaunch the python nodes a few times, but at some point you will get an unreasonable climb in memory. The increase will stop once both python nodes are killed, but the claimed memory will never be released.

## System (please complete the following information)

 - OS: Ubuntu 24.04.4 LTS
 - ROS 2 Distro: Jazzy
 - Install Method: APT
 - Version: 0.26.9-1noble.20260126.180134

## Additional context

The problem was identified in a robot running for about 10 hours. The scripts try to replicate the way I found to trigger the problem faster. I can reproduce it 100% of the time by launching the `pub_small.py` script, then launching our nodes and stopping them, which made me assume that the creation and deletion of topics would trigger the issue, which seems to be the case. Of course, in the real robot the climb is much slower and the script is meant to make it faster and more visible, but after ~10h it caused many of our robots to stop because of the OOM kill. With our entire system running, this way of triggering it makes the memory skyrocket very quickly. It seems like messages are not being released, but then Valgrind found no leak, so the memory seems to still be accessible, like a vector that was supposed to be cleared but never got or something like that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory "leak" #2366

Description

Expected Behavior

Actual Behavior

To Reproduce

System (please complete the following information)

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory "leak" #2366

Description

Description

Expected Behavior

Actual Behavior

To Reproduce

System (please complete the following information)

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions