Skip to content

perftest: Fix double frees during RDMA CM connection retry#368

Merged
sshaulnv merged 1 commit intolinux-rdma:masterfrom
SherrinZhou:fix/cm_retry_resource_leak
Mar 23, 2026
Merged

perftest: Fix double frees during RDMA CM connection retry#368
sshaulnv merged 1 commit intolinux-rdma:masterfrom
SherrinZhou:fix/cm_retry_resource_leak

Conversation

@SherrinZhou
Copy link
Copy Markdown
Contributor

@SherrinZhou SherrinZhou commented Dec 9, 2025

When running perftest with RDMA CM enabled under an environment where the server is under high pressure and likely to reject a CM connection issued by the client, if the connection request is rejected (RDMA_CM_EVENT_REJECTED), the client enters a retry loop in rdma_cm_client_connection.

However, the previous retry logic contained multiple flaws causing segmentation faults, double frees, and heap corruption.

The error print looked like this:
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: 8.
ERRNO:Operation not supported.
Failed to handle RDMA CM event.
ERRNO: Operation not supported.
Failed to connect RDMA CM events.
ERRNO:Operation not supported.
Failed to resolve RDMA CM address.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM ID number 0.
ERRNO: Bad file descriptor.
Failed to destroy RDMA CM contexts.
ERRNO: Bad file descriptor.
free(): double free detected in tcache 2

The backtrace of the segfault triggered core dump looked like this:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fcb90783db5 in __GI_abort () at abort.c:79
#2 0x00007fcb907dc4e7 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fcb908ebaae "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fcb907e35ec in malloc_printerr (str=str@entry=0x7fcb908ed6d8 "free(): double free detected in tcache 2") at malloc.c:5374
#4 0x00007fcb907e535d in _int_free (av=0x7fcb90b21bc0 <main_arena>, p=0xf459c0, have_lock=) at malloc.c:4213
#5 0x000000000040dd86 in create_rdma_cm_connection (ctx=0x7ffcf6f18410, user_param=0x7ffcf6f17fe0, comm=0x7ffcf6f17fc0, my_dest=0xf41590, rem_dest=0xf41a00) at src/perftest_communication.c:2949
#6 0x0000000000404b83 in main (argc=26, argv=0x7ffcf6f186f8) at src/send_bw.c:273

The following issues were identified and fixed:

  1. Double Free and Heap Corruption:
    The cleanup logic inside the retry loop destroyed resources (Event Channel,
    IDs) without clearing their pointers. Subsequent error handling paths tried
    to free them again, triggering "double free" or "Bad file descriptor".
    Additionally, connection_index was incremented unconditionally on every
    attempt, eventually overflowing the nodes array and corrupting heap metadata.

  2. Invalid Argument / Context Mismatch:
    The previous logic destroyed the Protection Domain (PD) and Event Channel
    on every retry but failed to properly re-initialize them or update the
    Context pointer. This caused ibv_create_qp to fail with ENOENT (No such
    file or directory) because it attempted to use a stale PD with a new Context.

  3. Client/Server State Desynchronization:
    Resetting the connection flow completely on the client side caused state
    desynchronization with the server (which tracks connection indices linearly),
    leading to further rejections.

This patch implements a robust "incremental retry" strategy:

  • Only failed QP/CM nodes are cleaned up and retried; established connections
    are preserved.
  • Global resources (PD, Event Channel, Context) are preserved across retries
    to ensure resource validity.
  • ctx_init is guarded to run only when the PD is not yet initialized.
  • Pointers are explicitly set to NULL after destruction to prevent double frees.
  • Memory leaks in hints->ai_src_addr allocation are fixed.

@SherrinZhou SherrinZhou force-pushed the fix/cm_retry_resource_leak branch 3 times, most recently from 3f0bf12 to 3cf9835 Compare December 12, 2025 06:42
@sshaulnv
Copy link
Copy Markdown
Contributor

Hi @SherrinZhou, thanks for the contribution.

i did some tests and encountered with seg fault which not reproduce with the master.
can you please review and fix?

ib_write_bw -d mlx5_0 1.1.1.1 -R

---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using Enhanced Reorder      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : Dynamic
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x9033 PSN 0x98d674
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:01:01
 remote address: LID 0000 QPN 0x9034 PSN 0xa31992
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:01:01
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1905.003000 != 1809.152000. CPU Frequency is not max.
 65536      5000             14117.13            13672.11                    0.218754
---------------------------------------------------------------------------------------

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000555555582758 in rdma_cm_destroy_cma (ctx=ctx@entry=0x7fffffffd9b0, user_param=user_param@entry=0x7fffffffdb80) at src/perftest_resources.c:6192
#2  0x0000555555583223 in destroy_ctx (ctx=0x7fffffffd9b0, user_param=0x7fffffffdb80) at src/perftest_resources.c:1592
#3  0x000055555555b396 in main (argc=<optimized out>, argv=<optimized out>) at src/write_bw.c:523

Comment thread src/perftest_resources.c Outdated
Comment thread src/perftest_resources.c
Comment on lines -6179 to -6184
rdma_destroy_event_channel(ctx->cma_master.channel);
if (ctx->cma_master.rai) {
rdma_freeaddrinfo(ctx->cma_master.rai);
int connected_count = 0;
for (i = 0; i < user_param->num_of_qps; i++) {
if (ctx->cma_master.nodes[i].connected) {
connected_count++;
}
}

free(ctx->cma_master.nodes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these were removed? we still need them in order to cleanup resource for the non-retry flows

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the cleanup of cma_master.channel, cma_master.rai, and the nodes array: they were removed from rdma_cm_destroy_cma() because preserving them is necessary to keep the global RDMA CM context alive during connection retries. Following your suggestion, I have introduced a new function rdma_cm_destroy_master(struct pingpong_context *ctx) to manage the cleanup of these global resources. It is now called explicitly after rdma_cm_destroy_cma() inside destroy_ctx() and in the create_rdma_cm_connection() error path to ensure proper cleanup for non-retry flows.

@sshaulnv
Copy link
Copy Markdown
Contributor

i would suggest to create new rdma_cm_destroy_master(struct pingpong_context *ctx); that destroy the cleanup the event channel, rai and nodes array. you can call it after the rdma_cm_destroy_cma()

@SherrinZhou SherrinZhou force-pushed the fix/cm_retry_resource_leak branch 5 times, most recently from 48a7102 to 426759f Compare March 18, 2026 05:48
@SherrinZhou
Copy link
Copy Markdown
Contributor Author

SherrinZhou commented Mar 18, 2026

On top of the aforementioned fixes, upon further review, in rdma_cm_route_handler(), I removed the flawed connection_index++ logic. The client now precisely maps the incoming cma_id back to its designated node index via a loop lookup. This prevents the index from drifting out-of-bounds during sequential retries, guaranteeing state alignment.

In the original code, the client relied on ctx->cma_master.connection_index to track which QP was being initialized during the RDMA_CM_EVENT_ROUTE_RESOLVED event. It unconditionally incremented this counter for every resolved route.
Consider a scenario with num_of_qps = 2:
QP0 route resolved → uses connection_index = 0, increments to 1.
QP1 route resolved → uses connection_index = 1, increments to 2.
Assume QP1 sends a connect request but gets RDMA_CM_EVENT_REJECTED (e.g., due to timeout or server backlog).
The client enters the retry loop, creates a new cma_id for QP1, and resolves the route again.
QP1 (Retry) route resolved → uses the current connection_index = 2
The code then attempts to access ctx->qp[2] and ctx->cma_master.nodes[2]. Since the array size is 2, this immediately causes an Out-Of-Bounds (OOB) memory corruption and leads to a SIGSEGV.

To fix this, the client must abandon the sequential counter. Since every cma_id is pre-allocated and statically bound to a specific node index inside rdma_cm_allocate_nodes(), the rdma_cm_route_handler should reverse-map the cma_id to find its true owner.

for (int i = 0; i < user_param->num_of_qps; i++) {
    if (ctx->cma_master.nodes[i].cma_id == cma_id) {
        connection_index = i;
        break;
    }
}

As for the server side, it's not affected. The server is strictly passive. Its connection_index only increments when a valid CONNECT_REQUEST actually arrives.

If a client's request is lost or rejected, the server's index does not advance. It simply waits for the next valid request to fill its current empty slot.

Because Perftest configures all QPs symmetrically (same depth, same size, same attributes), it is fundamentally irrelevant whether Client's QP[1] connects to Server's QP[1] or Server's QP[0].

The out-of-band data (like remote keys, addresses, and QP numbers) is exchanged via ctx_hand_shake() after all RDMA CM connections are fully established. The handshake iterates over the successfully paired connections, ensuring the matching QPs exchange the correct keys.

Thus, allowing the client to independently retry its specific failed QPs while the server sequentially accepts incoming requests guarantees both memory safety and robust connection recovery.

@SherrinZhou
Copy link
Copy Markdown
Contributor Author

Hi @sshaulnv ,
Thank you for the careful review. I have revised the code and made some adjustments. I would be glad if you could take a look at the revised V2 patch and let me know your opinion on it.

@sshaulnv
Copy link
Copy Markdown
Contributor

@SherrinZhou, i tried the new patch, and it got stuck after the result print waiting for event:

---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using Enhanced Reorder      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : Dynamic
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0188 PSN 0x3e2e9e
 remote address: LID 0x01 QPN 0x0188 PSN 0xdbe10e
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             47080.98            47079.00                    0.753264
---------------------------------------------------------------------------------------


(gdb) bt
#1  0x00007ffff7e1c0cc in rdma_get_cm_event () from /lib/x86_64-linux-gnu/librdmacm.so.1
#2  0x000055555556306c in rdma_cm_disconnect_nodes (ctx=ctx@entry=0x7fffffffdbc0, user_param=user_param@entry=0x7fffffffddc0) at src/perftest_communication.c:2932
#3  0x00005555555849a3 in destroy_ctx (ctx=ctx@entry=0x7fffffffdbc0, user_param=user_param@entry=0x7fffffffddc0) at src/perftest_resources.c:1458
#4  0x000055555555bd58 in main (argc=<optimized out>, argv=<optimized out>) at src/write_bw.c:575

When running perftest with RDMA CM enabled (-R), if the server rejects
a connection request (RDMA_CM_EVENT_REJECTED), the client enters a retry
loop. The initial retry logic suffered from severe state
desynchronization and memory corruption during resource teardown.

The following issues were identified and resolved:

1. Client State Desynchronization & Out-of-Bounds:
   The client's `rdma_cm_route_handler` previously used
`connection_index++` to track QP creation. During a retry, a failed node
would cause the index to increment anyway, eventually overflowing the
`nodes` array and writing resources into incorrect slots.
   Fix: Replaced the sequential counter with explicit `cma_id` reverse
mapping. The client now accurately identifies the node index bound to
the `cma_id`, keeping state strictly aligned even after multiple
retries.

2. Delayed Connection State Commitment:
   Nodes were marked as `connected = 1` too early (during route
resolution), shielding them from proper cleanup if they failed during
the final connect/accept phase.
   Fix: Moved the `connected = 1` assignment exclusively to the
`rdma_cm_connection_established_handler`.

Signed-off-by: Ruizhe Zhou <zhouruizhe@resnics.com>
@SherrinZhou SherrinZhou force-pushed the fix/cm_retry_resource_leak branch from 426759f to 3d9d8d0 Compare March 23, 2026 03:01
@sshaulnv sshaulnv merged commit 1d989ee into linux-rdma:master Mar 23, 2026
@sshaulnv
Copy link
Copy Markdown
Contributor

@SherrinZhou, thanks for the great effort, merged!

@SherrinZhou
Copy link
Copy Markdown
Contributor Author

Hi @sshaulnv. I have amended the code.
I tried to test this patch by manually rejecting the connection.

int rdma_cm_connection_request_handler(struct pingpong_context *ctx,
				       struct perftest_parameters *user_param,
				       struct rdma_cm_event *event, struct rdma_cm_id *cma_id)
{
	int rc, connection_index;
	char *error_message = "";
	struct cma_node *cm_node;
	struct rdma_conn_param conn_param;
	struct ibv_qp_attr rtr_attr = {
		.min_rnr_timer = MIN_RNR_TIMER,
	};

	static int inject_reject_cnt = 0;
	if (inject_reject_cnt < 2) {
		inject_reject_cnt++;
		printf("\n[DEBUG] Intentionally rejecting connection request #%d\n", inject_reject_cnt);
		rdma_reject(cma_id, NULL, 0);
		return 0;
	}
	//rest of the code

And there are still some problems in the clean-up flow. I have fixed them and run tests on my machines to confirm the fix.
I'm sure that the patch would work perfectly this time.
But I noticed that you have already merged the branch. Any ideas on what I should do to apply the fix?

@sshaulnv
Copy link
Copy Markdown
Contributor

Hi @SherrinZhou , please open another PR with the latest fix, i will test and merge if everything will pass successfully

@SherrinZhou
Copy link
Copy Markdown
Contributor Author

@sshaulnv ok, done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants