Improvement thread storage from std::array to stable_vector with dynamic growth#2542
Improvement thread storage from std::array to stable_vector with dynamic growth#2542anujshuk-amd wants to merge 5 commits intodevelopfrom
Conversation
4daacdc to
40f9b5e
Compare
| static std::atomic<size_t>& get_capacity() | ||
| { | ||
| static std::atomic<size_t> _cap{ max_threads }; | ||
| return _cap; | ||
| } |
There was a problem hiding this comment.
Can we just have static member instead of static variable inside function?
It would be great to avoid this style:
get_capacity().load(std::memory_order_acquire)
get_capacity().store(_v.size(), std::memory_order_release);
Because it's introducing confusion with get_* and then we are calling store, which is setting. It's contradictory.
With static member, we can directly call:
m_capacity.load(std::memory_order_acquire)
m_capacity.store(_v.size(), std::memory_order_release);
Which will be much readable.
There was a problem hiding this comment.
@mradosav-amd I yet to implement/considering your review comment. Wating to all check once to get passed. Why I took it as inside function is Lazy (on first call). And timemory Uses Function-Local Static.
271064c to
1359274
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors thread-local storage for counter_data_tracker from a fixed-size std::array to a dynamically resizing std::vector, eliminating the hard 4096 thread limit and reducing memory overhead while maintaining thread safety through atomic capacity tracking and mutex-protected resize operations.
Changes:
- Replaced
std::arraywithstd::vectorin thread storage, introducing dynamic capacity management with geometric growth - Added thread-safe capacity checks and resize operations using
std::atomic<size_t>andstd::mutex - Updated the timemory submodule to incorporate upstream changes supporting this refactoring
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| projects/rocprofiler-systems/source/lib/rocprof-sys/library/rocprofiler-sdk/counters.hpp | Implements dynamic thread storage with capacity management, replaces fixed array operations with vector operations, and adds thread-safe bounds checking |
| projects/rocprofiler-systems/external/timemory | Updates submodule to newer commit with compatible dynamic storage infrastructure |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| using type = ::rocprofsys::rocprofiler_sdk::counter_data_tracker; | ||
|
|
||
| template <> | ||
| struct set_storage<::rocprofsys::rocprofiler_sdk::counter_data_tracker> | ||
| struct set_storage<type> |
There was a problem hiding this comment.
The type alias at line 125 is defined outside the template specialization, making it accessible in the broader operation namespace. This could lead to naming conflicts or confusion. Consider moving the type alias inside the set_storage struct to keep it scoped to where it's used, or use a more specific name like counter_data_tracker_type.
| } | ||
|
|
||
| // Expose get_capacity for get_storage access | ||
| using base_type::get_capacity; |
There was a problem hiding this comment.
Exposing get_capacity from the protected base class breaks encapsulation. This allows external code to directly access the capacity, which should remain an internal implementation detail. Consider providing a controlled interface method in set_storage if capacity needs to be queried externally, or keep get_capacity access limited to friend classes.
| if(_idx >= | ||
| operation::set_storage<type>::get_capacity().load(std::memory_order_acquire)) |
There was a problem hiding this comment.
The multi-line condition with get_capacity().load(std::memory_order_acquire) is hard to read. Consider extracting this into a local variable like const size_t current_capacity = operation::set_storage<type>::get_capacity().load(std::memory_order_acquire); and then using that in the condition for better readability.
| if(_idx >= | |
| operation::set_storage<type>::get_capacity().load(std::memory_order_acquire)) | |
| const size_t current_capacity = | |
| operation::set_storage<type>::get_capacity().load(std::memory_order_acquire); | |
| if(_idx >= current_capacity) |
…owth - Replace fixed std::array with std::vector in counters.hpp - Implement thread-safe dynamic resizing with geometric growth (2x) - Add ensure_capacity() with double-checked locking pattern - Use std::atomic<size_t> for lock-free capacity reads - Add bounds checking in get_storage operation - Initial capacity set to 4096, grows as needed - Update timemory submodule to users/anujshuk/dynamic-thread-storage_ds (696a160d)
1359274 to
b68a11b
Compare
| @@ -160,6 +174,9 @@ struct get_storage<::rocprofsys::rocprofiler_sdk::counter_data_tracker> | |||
|
|
|||
| auto operator()(size_t _idx) const | |||
| { | |||
| // Thread-safe read using atomic capacity | |||
| const size_t current_capacity = operation::set_storage<type>::get_capacity(); | |||
| if(_idx >= current_capacity) return static_cast<storage<type>*>(nullptr); | |||
There was a problem hiding this comment.
I observed a potential data race edge case
The atomic capacity check on lines 178-179 doesn't fully protect against a concurrent resize. If ensure_capacity() reallocates the vector while this read is in progress, the reader could access freed memory.
Scenario:
- Reader checks get_capacity() -> returns valid index
- Writer calls ensure_capacity() -> vector reallocates
- Reader calls get().at(_idx) -> accesses freed memory (Undefined Behaviour)
Practical risk: Only occurs when exceeding ROCPROFSYS_MAX_THREADS(4096) during concurrent access. When we run AI workloads like tensorflow workloads then it easily exceeds MAX_THREADS but not in other cases.
Not blocking, but worth noting for future consideration:
- Could use std::shared_mutex (read/write lock) for full protection
- Or document that exceeding max_threads during concurrent access has undefined behavior
|
This is huge mistake. You need something like stable_vector which exists somewhere in rocprofiler-systems or timemory -- it is effectively a vector of arrays. |
jrmadsen
left a comment
There was a problem hiding this comment.
See previous comment. This PR suggests there is a fundamental misunderstanding about how the memory allocations of std::vector work under the hood and the dangers of resizing in multithreading scenarios.
2bae79b to
e6072a4
Compare
Thank you very much for taking the time to review the changes. Your insights are invaluable, and I truly appreciate your keen eye for detail. The suggestions regarding the data structures have been particularly helpful. I kindly request your further assistance in this. Your expertise would greatly enhance the final outcome. |
fdb4589 to
e9a38d4
Compare
Using: ROCm/timemory#21
Motivation
This pull request updates the way thread-local storage is managed for
counter_data_trackerin the rocprofiler SDK, improving its scalability and flexibility. The most important changes include switching from a fixed-size array to a dynamically sizedstable_vector, ensuring storage can grow as needed, and updating related initialization and access patterns.Technical Details
Thread-local storage improvements:
set_storage<counter_data_tracker>from a fixed-sizestd::arrayto a dynamically sizedstable_vector, allowing the storage to grow beyond a fixed thread limit.4096) with a configurableROCPROFSYS_MAX_THREADSfor initial capacity, improving flexibility.ensure_capacityhelper function to dynamically expand the storage vector when accessing or assigning storage for a thread index, preventing out-of-bounds errors. [1] [2]timemoryto a new commit, which may bring in upstream improvements or fixes.Submodule update
timemorysubmodule to a newer commit, likely to incorporate upstream improvements or fixes.JIRA ID
Using: ROCm/timemory#21
TBA
Test Plan
TBA
Test Result
TBA
Submission Checklist