Skip to content

Improvement thread storage from std::array to stable_vector with dynamic growth#2542

Open
anujshuk-amd wants to merge 5 commits intodevelopfrom
anujshuk/anujshuk-amd/dynamic-thread-storage
Open

Improvement thread storage from std::array to stable_vector with dynamic growth#2542
anujshuk-amd wants to merge 5 commits intodevelopfrom
anujshuk/anujshuk-amd/dynamic-thread-storage

Conversation

@anujshuk-amd
Copy link
Copy Markdown
Contributor

@anujshuk-amd anujshuk-amd commented Jan 8, 2026

Using: ROCm/timemory#21

Motivation

This pull request updates the way thread-local storage is managed for counter_data_tracker in the rocprofiler SDK, improving its scalability and flexibility. The most important changes include switching from a fixed-size array to a dynamically sized stable_vector, ensuring storage can grow as needed, and updating related initialization and access patterns.

Technical Details

Thread-local storage improvements:

  • Changed the thread storage container in set_storage<counter_data_tracker> from a fixed-size std::array to a dynamically sized stable_vector, allowing the storage to grow beyond a fixed thread limit.
  • Replaced the hardcoded maximum thread count (4096) with a configurable ROCPROFSYS_MAX_THREADS for initial capacity, improving flexibility.
  • Added an ensure_capacity helper function to dynamically expand the storage vector when accessing or assigning storage for a thread index, preventing out-of-bounds errors. [1] [2]
  • Updated the storage filling logic to iterate over the current vector size, ensuring all existing elements are set correctly, rather than relying on array fill.
  • Updated the submodule timemory to a new commit, which may bring in upstream improvements or fixes.

Submodule update

  • Updated the timemory submodule to a newer commit, likely to incorporate upstream improvements or fixes.

JIRA ID

Using: ROCm/timemory#21
TBA

Test Plan

TBA

Test Result

TBA

Submission Checklist

@anujshuk-amd anujshuk-amd self-assigned this Jan 8, 2026
@anujshuk-amd anujshuk-amd changed the title Improvement Convert thread storage from std::array to std::vector with dynamic growth Improvement thread storage from std::array to std::vector with dynamic growth Jan 8, 2026
@anujshuk-amd anujshuk-amd force-pushed the anujshuk/anujshuk-amd/dynamic-thread-storage branch 2 times, most recently from 4daacdc to 40f9b5e Compare January 8, 2026 19:56
Comment on lines +144 to +148
static std::atomic<size_t>& get_capacity()
{
static std::atomic<size_t> _cap{ max_threads };
return _cap;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have static member instead of static variable inside function?
It would be great to avoid this style:
get_capacity().load(std::memory_order_acquire)
get_capacity().store(_v.size(), std::memory_order_release);

Because it's introducing confusion with get_* and then we are calling store, which is setting. It's contradictory.

With static member, we can directly call:
m_capacity.load(std::memory_order_acquire)
m_capacity.store(_v.size(), std::memory_order_release);

Which will be much readable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mradosav-amd I yet to implement/considering your review comment. Wating to all check once to get passed. Why I took it as inside function is Lazy (on first call). And timemory Uses Function-Local Static.

@anujshuk-amd anujshuk-amd force-pushed the anujshuk/anujshuk-amd/dynamic-thread-storage branch 4 times, most recently from 271064c to 1359274 Compare January 14, 2026 19:42
@anujshuk-amd anujshuk-amd marked this pull request as ready for review January 19, 2026 15:32
@anujshuk-amd anujshuk-amd requested review from a team and jrmadsen as code owners January 19, 2026 15:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors thread-local storage for counter_data_tracker from a fixed-size std::array to a dynamically resizing std::vector, eliminating the hard 4096 thread limit and reducing memory overhead while maintaining thread safety through atomic capacity tracking and mutex-protected resize operations.

Changes:

  • Replaced std::array with std::vector in thread storage, introducing dynamic capacity management with geometric growth
  • Added thread-safe capacity checks and resize operations using std::atomic<size_t> and std::mutex
  • Updated the timemory submodule to incorporate upstream changes supporting this refactoring

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
projects/rocprofiler-systems/source/lib/rocprof-sys/library/rocprofiler-sdk/counters.hpp Implements dynamic thread storage with capacity management, replaces fixed array operations with vector operations, and adds thread-safe bounds checking
projects/rocprofiler-systems/external/timemory Updates submodule to newer commit with compatible dynamic storage infrastructure

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +128
using type = ::rocprofsys::rocprofiler_sdk::counter_data_tracker;

template <>
struct set_storage<::rocprofsys::rocprofiler_sdk::counter_data_tracker>
struct set_storage<type>
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type alias at line 125 is defined outside the template specialization, making it accessible in the broader operation namespace. This could lead to naming conflicts or confusion. Consider moving the type alias inside the set_storage struct to keep it scoped to where it's used, or use a more specific name like counter_data_tracker_type.

Copilot uses AI. Check for mistakes.
}

// Expose get_capacity for get_storage access
using base_type::get_capacity;
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing get_capacity from the protected base class breaks encapsulation. This allows external code to directly access the capacity, which should remain an internal implementation detail. Consider providing a controlled interface method in set_storage if capacity needs to be queried externally, or keep get_capacity access limited to friend classes.

Copilot uses AI. Check for mistakes.
Comment on lines +177 to +178
if(_idx >=
operation::set_storage<type>::get_capacity().load(std::memory_order_acquire))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-line condition with get_capacity().load(std::memory_order_acquire) is hard to read. Consider extracting this into a local variable like const size_t current_capacity = operation::set_storage<type>::get_capacity().load(std::memory_order_acquire); and then using that in the condition for better readability.

Suggested change
if(_idx >=
operation::set_storage<type>::get_capacity().load(std::memory_order_acquire))
const size_t current_capacity =
operation::set_storage<type>::get_capacity().load(std::memory_order_acquire);
if(_idx >= current_capacity)

Copilot uses AI. Check for mistakes.
…owth

- Replace fixed std::array with std::vector in counters.hpp
- Implement thread-safe dynamic resizing with geometric growth (2x)
- Add ensure_capacity() with double-checked locking pattern
- Use std::atomic<size_t> for lock-free capacity reads
- Add bounds checking in get_storage operation
- Initial capacity set to 4096, grows as needed
- Update timemory submodule to users/anujshuk/dynamic-thread-storage_ds (696a160d)
@anujshuk-amd anujshuk-amd force-pushed the anujshuk/anujshuk-amd/dynamic-thread-storage branch from 1359274 to b68a11b Compare January 19, 2026 17:18
@@ -160,6 +174,9 @@ struct get_storage<::rocprofsys::rocprofiler_sdk::counter_data_tracker>

auto operator()(size_t _idx) const
{
// Thread-safe read using atomic capacity
const size_t current_capacity = operation::set_storage<type>::get_capacity();
if(_idx >= current_capacity) return static_cast<storage<type>*>(nullptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed a potential data race edge case

The atomic capacity check on lines 178-179 doesn't fully protect against a concurrent resize. If ensure_capacity() reallocates the vector while this read is in progress, the reader could access freed memory.

Scenario:

  1. Reader checks get_capacity() -> returns valid index
  2. Writer calls ensure_capacity() -> vector reallocates
  3. Reader calls get().at(_idx) -> accesses freed memory (Undefined Behaviour)

Practical risk: Only occurs when exceeding ROCPROFSYS_MAX_THREADS(4096) during concurrent access. When we run AI workloads like tensorflow workloads then it easily exceeds MAX_THREADS but not in other cases.

Not blocking, but worth noting for future consideration:

  • Could use std::shared_mutex (read/write lock) for full protection
  • Or document that exceeding max_threads during concurrent access has undefined behavior

@jrmadsen
Copy link
Copy Markdown
Contributor

This is huge mistake. You need something like stable_vector which exists somewhere in rocprofiler-systems or timemory -- it is effectively a vector of arrays. std::vector guarantees contiguous memory -- which means unless your vector is storing pointers (and thus, every access to the thread-local data requires pointer chasing), a reallocation of the underlying memory in order to expand the vector will cause extremely difficult to debug/reproduce memory errors. And trying to implement a scheme with R/W mutexes makes it unwieldy to use properly since you have to hold the lock the entire time you access the data, not just hold the lock to get a reference/pointer... and holding the lock is going to introduce overhead that doesn't exist currently. I knew exactly what I was doing when I used std::array instead of std::vector and the ability to dynamically expand via stable_vector is the solution here.

Copy link
Copy Markdown
Contributor

@jrmadsen jrmadsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment. This PR suggests there is a fundamental misunderstanding about how the memory allocations of std::vector work under the hood and the dangers of resizing in multithreading scenarios.

@anujshuk-amd anujshuk-amd force-pushed the anujshuk/anujshuk-amd/dynamic-thread-storage branch from 2bae79b to e6072a4 Compare January 28, 2026 22:02
@anujshuk-amd
Copy link
Copy Markdown
Contributor Author

This is huge mistake. You need something like stable_vector which exists somewhere in rocprofiler-systems or timemory -- it is effectively a vector of arrays. std::vector guarantees contiguous memory -- which means unless your vector is storing pointers (and thus, every access to the thread-local data requires pointer chasing), a reallocation of the underlying memory in order to expand the vector will cause extremely difficult to debug/reproduce memory errors. And trying to implement a scheme with R/W mutexes makes it unwieldy to use properly since you have to hold the lock the entire time you access the data, not just hold the lock to get a reference/pointer... and holding the lock is going to introduce overhead that doesn't exist currently. I knew exactly what I was doing when I used std::array instead of std::vector and the ability to dynamically expand via stable_vector is the solution here.

Thank you very much for taking the time to review the changes. Your insights are invaluable, and I truly appreciate your keen eye for detail. The suggestions regarding the data structures have been particularly helpful. I kindly request your further assistance in this. Your expertise would greatly enhance the final outcome.

@anujshuk-amd anujshuk-amd requested a review from jrmadsen January 28, 2026 22:14
@anujshuk-amd anujshuk-amd force-pushed the anujshuk/anujshuk-amd/dynamic-thread-storage branch from fdb4589 to e9a38d4 Compare January 28, 2026 22:56
@dgaliffiAMD dgaliffiAMD marked this pull request as draft January 29, 2026 03:06
@anujshuk-amd anujshuk-amd changed the title Improvement thread storage from std::array to std::vector with dynamic growth Improvement thread storage from std::array to stable_vector with dynamic growth Jan 29, 2026
@anujshuk-amd anujshuk-amd marked this pull request as ready for review January 30, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants