Skip to content

Extend memory tracking resources#3051

Open
huuanhhuyn wants to merge 5 commits into
NVIDIA:mainfrom
huuanhhuyn:fix-alloc-mislabel
Open

Extend memory tracking resources#3051
huuanhhuyn wants to merge 5 commits into
NVIDIA:mainfrom
huuanhhuyn:fix-alloc-mislabel

Conversation

@huuanhhuyn

@huuanhhuyn huuanhhuyn commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Extend memory tracking resources with another queue-based approach (as opposed to the current sampling-rate notification approach).

Usage:
If the sampling interval is given, the notification approach is selected:
raft::memory_tracking_resources tracked(res, oss, 1ms);
If it is NOT given, the queue-based approach is selected:
raft::memory_tracking_resources tracked(res, oss);

Compare two approaches:

  • queue-based recording approach preserves all (de)allocation events in the queue and associate the full nvtx range to each event. Additionally, it labels a deallocation to the nvtx range where the corresponding allocation occurs. This should be used when labels are essential and certain overhead for debugging is accepted.
  • notification approach is less invasive with low overhead. This is used when label accuracy is not important and several dropped events are acceptable.

Unit test benchmark (H100, 64 threads, each thread 200x allocations, each allocation 256KiB):

  • recording approach: 34ms, all 25600 events recorded
  • sampling approach: 7ms, around 100 events recorded
image

@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch 3 times, most recently from ef5b83f to e852f30 Compare June 16, 2026 11:43
@huuanhhuyn huuanhhuyn changed the title [WIP] Reproduce allocation mislabelling issue [WIP] Extend memory tracking resources tool Jun 16, 2026
@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch from e852f30 to f4e4cee Compare June 30, 2026 09:16
@huuanhhuyn huuanhhuyn requested review from a team as code owners June 30, 2026 09:16
@huuanhhuyn huuanhhuyn changed the title [WIP] Extend memory tracking resources tool [WIP] Extend memory tracking resources Jul 1, 2026
@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch 2 times, most recently from 32d3c6c to 455f959 Compare July 1, 2026 11:52
@huuanhhuyn huuanhhuyn force-pushed the fix-alloc-mislabel branch from 455f959 to a315972 Compare July 1, 2026 12:43
@huuanhhuyn huuanhhuyn changed the title [WIP] Extend memory tracking resources Extend memory tracking resources Jul 1, 2026

@achirkin achirkin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description it follows that the new queue approach attaches NVTX annotations to the allocation events one-to-one. If so, it seems logical that NVTX annotations should be pulled from the allocating thread during the allocation rather than from the main thread. Then you'd also not need mutex locking of nvtx records since they are not accessed across threads.

Please refactor this as a separate resource type rather than changing the behavior of the existing resource, because the difference is significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants