-
Notifications
You must be signed in to change notification settings - Fork 520
Feature/systemds 3730 multithreaded roll operation #2376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/systemds 3730 multithreaded roll operation #2376
Conversation
This patch introduces multi-threading support for the roll operation to improve performance. The RollTest.java has been updated to cover both single and multithreaded execution modes. Furthermore, this update adds comprehensive consistency checks to ensure mathematical correctness. New tests were created to validate both dense and sparse matrix inputs. Additionally, cross-verification tests were added to confirm that sparse and dense rolling for single and multithreaded executions produce identical results.
This commit adds logs to measure the speedup between the single threaded and multithreaded roll operation (on dense and sparse matrices).
|
Thank you for the very good PR @Biranavan-Parameswaran and thanks for addressing the normalization issue for negative shifts in the roll operation. I particularly like that you added new correctness tests to increase test coverage and that you provide benchmarks to verify speedups for larger / dense matrices. I suggest that It would be interesting to see the performance of larger dense matrices with few numbers of columns (including column vectors). My guess is that performance drops in that scenario because you then do many array copies on small ranges. If the underlying row-major |
src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixReorg.java
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2376 +/- ##
============================================
+ Coverage 72.29% 72.33% +0.03%
- Complexity 46937 47025 +88
============================================
Files 1513 1513
Lines 178421 178806 +385
Branches 35036 35092 +56
============================================
+ Hits 128993 129341 +348
- Misses 39665 39686 +21
- Partials 9763 9779 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Introduced a new test class to properly measure the performance of single-threaded and multithreaded rolling and remove unnecessary if condition
|
This patch introduces multi threading support for the roll operation to improve performance across a wide range of matrix shapes and sizes. The new RollOperation. Single vs Multithreaded Performance SummaryTest Coverage and MotivationPerformance was evaluated across multiple matrix configurations to understand how matrix shape, density, and size affect multithreaded scaling. The goal was to validate expected speedups and identify edge cases where multithreading may not be beneficial. General Performance ObservationsDense Matrices with Moderate Aspect RatiosFor dense matrices with moderate to large column counts, multithreaded execution consistently improves performance.
These results confirm that the multithreaded roll implementation behaves as expected for typical dense workloads. Sparse MatricesSparse matrices show more nuanced behavior.
This aligns with my expectations. Sparse workloads require sufficient actual work per thread to justify parallel execution. Matrices with Few Columns and Large Row CountsAdditional experiments were conducted for very tall matrices with few columns, including the extreme case of Initial BehaviorFor column vectors, the original multithreaded implementation performed significantly worse than the single threaded version. This was due to many small copy operations and excessive threading overhead. Attempted OptimizationAn optimized multithreaded implementation was prototyped for the case where the underlying block is This optimization significantly improved the multithreaded behavior compared to the previous implementation. Final ResultsDespite the improvement, benchmarking shows that even with this optimization:
This indicates that the operation is memory bandwidth bound and already near optimal when using one or two large Because the optimized multithreaded path for Conclusion and Proposed Direction
Based on these findings, a reasonable follow up improvement would be to explicitly force a single threaded roll path for If there is agreement on this approach, I can follow up with a targeted change implementing this specialization. |
|
Thank you for the changes and the detailed performance analysis @Biranavan-Parameswaran. There is still a missing license that should be added. |
|
Thank you for the reminder @janniklinde . I have added the missing license. |
|
LGTM @mboehm7 - one possible concern I have both with the single-threaded and multi-threaded implementation is that Thank you again for your contribution, bugfixes, and detailed tests and benchmarks @Biranavan-Parameswaran. It appears that |
This PR implements a multithreaded roll operation as issued in SYSTEMDS-3730
This patch introduces multi-threading support for the roll operation to
improve performance. The RollTest.java has been updated to cover
both single and multithreaded execution modes.
Furthermore, this update adds comprehensive consistency checks to ensure
mathematical correctness. New tests were created to validate both dense
and sparse matrix inputs. Additionally, cross-verification tests were
added to confirm that sparse and dense rolling for single and
multithreaded executions produce identical results.
RollOperation: Single vs Multithreaded Performance Summary
Test Matrix Sizes
Two ranges of matrix sizes were evaluated:
Rows = 2017–2516, Cols = 1001–1500
(Defined as:
MIN_ROWS = 2017,MIN_COLS = 1001, plus up to +500 random)Rows ≈ 8000–8500, Cols ≈ 4000–4500
(Same generation logic but higher baseline)
These two ranges allow observing how workload size impacts single-threaded (ST) vs multithreaded (MT) performance.
Dense Matrices
Sparse Matrices (1% sparsity)
Key Point
This shows that very sparse matrices should stay single-threaded, while denser workloads benefit from multithreading.
@janniklinde will review this PR.