Skip to content

[TENT] Fix RDMA task memory corruption#1812

Open
alogfans wants to merge 2 commits intokvcache-ai:mainfrom
alogfans:fix-tent-task-memory-corruption
Open

[TENT] Fix RDMA task memory corruption#1812
alogfans wants to merge 2 commits intokvcache-ai:mainfrom
alogfans:fix-tent-task-memory-corruption

Conversation

@alogfans
Copy link
Copy Markdown
Collaborator

@alogfans alogfans commented Apr 3, 2026

Description

When a large request is split into multiple slices, a race condition occurs between
the worker thread's acknowledge() and the user thread's lazyFreeBatch():

  1. Worker thread processes the last slice and sets task->status_word = COMPLETED
  2. User thread's lazyFreeBatch() sees COMPLETED and immediately frees rdma_batch
  3. Worker thread's acknowledge() is still executing and accesses slice->task
  4. Since slice->task points to memory inside rdma_batch->task_list, this becomes
    a use-after-free, causing memory corruption

The issue is that RdmaTask has a dependent lifecycle on RdmaSubBatch, but slices
may outlive the batch.

Solution:

  • Make RdmaTask independently allocated from Slab with reference counting
  • Each slice holds a reference to its task
  • RdmaTask is only deallocated when all its slices are released
  • slice->task remains valid even after rdma_batch is freed

Changes:

  • Add std::atomic<int> ref_count and ref()/deref() methods to RdmaTask
  • Change RdmaSubBatch::task_list from vector<RdmaTask> to vector<RdmaTask*>
  • Allocate tasks from RdmaTaskStorage::Get().allocate()
  • Each slice calls task->ref() when created
  • freeSubBatch() calls task->deref() when releasing

This ensures slice->task has an independent lifecycle and is always valid during
acknowledge() execution, eliminating the race condition.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions RdmaTask to a pointer-based lifecycle managed by a reference counting mechanism and a Slab allocator. Key changes include updating RdmaSubBatch to store task pointers and implementing ref/deref logic. Review feedback identifies a compilation error regarding the undefined slice_dev_ids variable and critical flaws in the reference counting implementation, where a mismatch between slice-level increments and batch-level decrements results in memory leaks.

slice->next = nullptr;
slice->enqueue_ts = enqueue_ts;
task.num_slices++;
slice->source_dev_id = slice_dev_ids[slice_idx];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable slice_dev_ids is used here but is not defined within the scope of the submitTransferTasks function. This will cause a compilation error.

Comment on lines +232 to +234
for (auto task : rdma_batch->task_list) {
task->deref();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a mismatch in the reference counting logic that will lead to memory leaks. In submitTransferTasks, task->ref() is called for every slice created (line 327). If a task has $N$ slices, its ref_count becomes $N$. However, in freeSubBatch, task->deref() is only called once per task. Since RdmaSlice does not currently call deref() when it is deallocated, $N-1$ references will remain, preventing the RdmaTask from ever being freed. Furthermore, if a task has 0 slices, ref_count starts at 0 and deref() will decrement it to -1, also failing to trigger deallocation.

task->success_slices = 0;
task->resolved_slices = 0;
task->first_error = PENDING;
task->ref_count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting task->ref_count = 0 here is redundant as the RdmaTask struct already initializes it to 0. However, to fix the lifecycle logic, the batch itself should probably hold a reference to the task while it is in the task_list. Consider initializing ref_count to 1 to represent the batch's ownership, and then having each slice increment it further.

        task->ref_count = 1;

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants