Skip to content

Memory "leak" #2366

@acristoffers

Description

@acristoffers

Description

We observed that after hours of operation, the node was killed by OOM-killer. I then tried to figure out where the problem is, and a MWE could reproduce it, showing it was not a fault in our code. I tried a simple node to see if the problem was in the RMW, but just a node does not reproduce the problem, only when the rosbag2_transport::Recorder class is used.

Expected Behavior

The node should free all memory it no longer uses.

Actual Behavior

The node keeps climbing on memory usage. Valgrind was not able to find the problem, but running a simple node we can replicate the issue.

To Reproduce

It may take a bit of retrying, but:

#include <rosbag2_cpp/writer.hpp>
#include <rosbag2_storage/storage_options.hpp>
#include <rosbag2_transport/record_options.hpp>
#include <rosbag2_transport/recorder.hpp>

using namespace std::chrono_literals;

static const rosbag2_storage::StorageOptions STORAGE_OPTIONS{
    .uri                    = "/tmp/bag_test",
    .storage_id             = "mcap",
    .max_bagfile_duration   = 10,
};

static const rosbag2_transport::RecordOptions RECORD_OPTIONS{
    .all_topics                   = true,
    .all_services                 = true,
    .rmw_serialization_format     = "cdr",
    .topic_polling_interval       = 100ms,
};

class MemSubNode : public rosbag2_transport::Recorder {
   public:
    explicit MemSubNode(const rclcpp::NodeOptions& options = rclcpp::NodeOptions())
        : rosbag2_transport::Recorder(std::make_shared<rosbag2_cpp::Writer>(),
                                      STORAGE_OPTIONS,
                                      RECORD_OPTIONS,
                                      "vb_mem_sub",
                                      options) {
        record();
    }
};

int main(int argc, char* argv[]) {
    rclcpp::init(argc, argv);
    rclcpp::spin(std::make_shared<MemSubNode>());
    rclcpp::shutdown();
}

pub_big.py:

import rclpy
import os
import sys
from rclpy.node import Node
from std_msgs.msg import ByteMultiArray

class BigPub(Node):
    def __init__(self):
        super().__init__("big_pub")
        self.pub = self.create_publisher(ByteMultiArray, "/big_topic", 10)
        for i in range(100):
            pub = self.create_publisher(ByteMultiArray, f"/many/t{i}", 1)
            msg = ByteMultiArray()
            msg.data = bytearray(os.urandom(16))
            pub.publish(msg)

        msg = ByteMultiArray()

        for _ in range(512):
            msg.data = bytearray(os.urandom(int(sys.argv[1]) * 1024**2))
            self.pub.publish(msg)
        self.get_logger().info("Published big message")
        exit(0)

rclpy.init()
rclpy.spin(BigPub())

pub_small.py:

import rclpy
import os
import sys
from rclpy.node import Node
from std_msgs.msg import ByteMultiArray

class BigPub(Node):
    def __init__(self):
        super().__init__("small_pub")
        self.pub = self.create_publisher(ByteMultiArray, f"/small_topic", 1)

        n = int(sys.argv[1])
        msg = ByteMultiArray()
        for _ in range(n):
            msg.data = bytearray(os.urandom(64))
            self.pub.publish(msg)
        self.get_logger().info("Published small message")
        exit(0)

rclpy.init()
rclpy.spin(BigPub())

Compile the node. I called it vb_mem_sub, and then in one terminal I run it and in another I watch its memory usage with (or you can just use [h]top):

while true; do
      ps --no-headers eo comm,rss,%mem --pid $(pidof vb_mem_sub)
      sleep 1
done

In another terminal, run python3 pub_small.py 1000000 and in another run python3 pub_big.py 10; python3 pub_big.py 1. You may need to stop and relaunch the python nodes a few times, but at some point you will get an unreasonable climb in memory. The increase will stop once both python nodes are killed, but the claimed memory will never be released.

System (please complete the following information)

  • OS: Ubuntu 24.04.4 LTS
  • ROS 2 Distro: Jazzy
  • Install Method: APT
  • Version: 0.26.9-1noble.20260126.180134

Additional context

The problem was identified in a robot running for about 10 hours. The scripts try to replicate the way I found to trigger the problem faster. I can reproduce it 100% of the time by launching the pub_small.py script, then launching our nodes and stopping them, which made me assume that the creation and deletion of topics would trigger the issue, which seems to be the case. Of course, in the real robot the climb is much slower and the script is meant to make it faster and more visible, but after ~10h it caused many of our robots to stop because of the OOM kill. With our entire system running, this way of triggering it makes the memory skyrocket very quickly. It seems like messages are not being released, but then Valgrind found no leak, so the memory seems to still be accessible, like a vector that was supposed to be cleared but never got or something like that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions