Skip to content

Conversation

@G-Ragghianti
Copy link
Contributor

@G-Ragghianti G-Ragghianti commented Apr 21, 2025

Need this PR for testing, do not merge.

This adds a github workflow yaml in order to test parsec via dplasma. The YAML comes from the CI for dplasma with slight changes. Basically, from the checked-out parsec source, we create a clone of dplasma and execute the CI test of dplasma from that subdirectory. In order to use the newest sources of parsec instead of the dplasma submodule, we remove the parsec subdirectory in dplasma and replace it with a symlink to the parent directory.

@abouteiller Is this what you had in mind?

@G-Ragghianti G-Ragghianti requested a review from bosilca as a code owner April 21, 2025 19:09
@G-Ragghianti G-Ragghianti requested a review from a team as a code owner May 7, 2025 17:29
@bosilca bosilca marked this pull request as draft May 7, 2025 19:03
@G-Ragghianti
Copy link
Contributor Author

I'm hitting an error where dplasma can't find a header from parsec. The parsec build isn't installing it:

2025-05-07T20:55:12.6393336Z /tmp/parsec/parsec/dplasma/build/src/dplrnt_wrapper.c:14:10: fatal error: parsec/data_dist/matrix/apply.h: No such file or directory
2025-05-07T20:55:12.6394495Z    14 | #include "parsec/data_dist/matrix/apply.h"

@bosilca
Copy link
Contributor

bosilca commented May 7, 2025

IT's an auto-generated header from the parsec/data_dist/matrix/apply.jdf, and these don't get installed by default. I need to look a little in the code to see how we can get at least this one installed.

@G-Ragghianti
Copy link
Contributor Author

G-Ragghianti commented May 7, 2025

I had a look at the cmake config, and I think I see where it would need to be added. I can create a patchfile to include in this PR (just for testing), and if that's all that's required, we can make the change in the dplasma repo. Sound good?

Edit: I forgot which repo this was in, so no patch is needed. I just added something that results in the apply.h being installed.

@abouteiller
Copy link
Contributor

This feature is already finding issues, great :D

@G-Ragghianti G-Ragghianti force-pushed the gragghia/ci_dplasma branch 2 times, most recently from 86fa51b to 1a01c3c Compare May 8, 2025 16:07
@G-Ragghianti
Copy link
Contributor Author

I also found a problem in the CMakeLists.txt for DPLASMA where MPIEXEC_NUMPROC_FLAGS was used instead of MPIEXEC_NUMPROC_FLAG (resulting in a failure to launch the mpi job). Now it looks like the only errors are due to "suspicious" solutions.

@G-Ragghianti
Copy link
Contributor Author

Found another problem where dplasma can't find HIP if using an external parsec because the default rocm directory /opt/rocm isn't added to CMAKE_SYSTEM_PREFIX_PATH in PaRSECConfig.cmake. I just added it to verify the source of the problem, but I'm not sure if that is the "correct" way.

All tests are now running, and the only remaining problems are in incorrect tester results or tester segfaults.

@G-Ragghianti
Copy link
Contributor Author

60: Test command: /apps/spacks/2024-07-19/github_env/var/spack/environments/dplasma/.spack-env/view/bin/mpiexec "-n" "4" "./testing_dgetrf_1d" "-N" "378" "-t" "19" "-P" "1" "-x" "-v=5"
60: Working Directory: /tmp/parsec/build/dplasma/build/tests
60: Environment variables: 
60:  PARSEC_MCA_device_cuda_enabled=0
60:  PARSEC_MCA_device_hip_enabled=0
60:  PARSEC_MCA_device_level_zero_enabled=0
60:  PARSEC_MCA_device_cuda_memory_use=10
60:  PARSEC_MCA_device_hip_memory_use=10
60:  PARSEC_MCA_device_level_zero_memory_use=10
60: Test timeout computed to be: 1500
60: #+++++ cores detected       : 36
60: #+++++ nodes x cores + gpu  : 4 x 36 + 0 (144+0)
60: #+++++ thread mode          : THREAD_SERIALIZED
60: #+++++ P x Q                : 1 x 4 (4/4)
60: #+++++ M x N x K|NRHS       : 378 x 378 x 1
60: #+++++ LDA , LDB            : 378 , 378
60: #+++++ MB x NB , IB         : 19 x 19 , 40
60: [   0] TIME(s)      0.03314 : PaRSEC initialized
60: [   2] TIME(s)      0.03323 : PaRSEC initialized
60: [   3] TIME(s)      0.03325 : PaRSEC initialized
60: [   1] TIME(s)      0.03325 : PaRSEC initialized
60: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
60: 	This is often unintentional, and will perform poorly.
60: 	Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
60: 	and hide the real binding from PaRSEC; if you verified that the binding is correct,
60: 	this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
60: +++ Generate matrices ... Done
60: +++ Generate matrices ... Done
60: +++ Generate matrices ... Done
60: +++ Generate matrices ... Done
60: +++ Computing getrf ... Done.
60: +++ Computing getrf ... Done.
60: +++ Computing getrf ... Done.
60: +++ Computing getrf ... [****] TIME(s)      0.73037 : dgetrf_1d	PxQxg=   1 4   0 NB=   19 N=     378 :       0.049202 gflops - ENQ&PROG&DEST      0.73120 :       0.049146 gflops - ENQ      0.00074 - DEST      0.00008
60: <DartMeasurement name="performance" type="numeric/double"
60:                  encoding="none" compression="none">
60: 0.0492018
60: </DartMeasurement>
60: Done.
60: ============
60: Checking the Residual of the solution 
60: -- ||A||_oo = 1.025373e+02, ||X||_oo = 2.768754e+00, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 4.776786e+00
60: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 4.002243e+11 
60: -- Solution is suspicious ! 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants