Revamp of the binding layer. #515

bosilca · 2023-04-05T18:52:46Z

Rework the bindings. The main idea is to inherit the bindings from the batch scheduler, and then work from there.

devreal

Much needed 👍 two comments inline, otherwise LGTM

parsec/parsec.c

therault · 2023-08-30T21:14:42Z

I'm trying it on leconte.

mpirun -np 2 --map-by socket hwloc-info --restrict binding package:0
Package L#0
 [...]
 cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 complete cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 allowed cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 nodeset = 0x00000002
 complete nodeset = 0x00000002
 allowed nodeset = 0x00000002
 [...]
Package L#0
 [...]
 cpuset = 0x0fffff00,0x000fffff
 complete cpuset = 0x0fffff00,0x000fffff
 allowed cpuset = 0x0fffff00,0x000fffff
 nodeset = 0x00000001
 complete nodeset = 0x00000001
 allowed nodeset = 0x00000001
  [...]

Which I interpret as when running mpirun -np 2 --map-by socket on this machine (module load openmpi), there are two processes, and rank 0 should use a different set of cores than rank 1 (and the cpuset seems complicated, but it does seem exclusive).

Now, I run a parsec test with this PR:

mpirun -np 2 --map-by socket ./tests/apps/stencil/testing_stencil_1D -M 40960 -N 40960 -t 16 -T 16 -P 2
Process binding [rank 0]: cpuset [ALLOWED  ]: 0x0fffff00,0x000fffff
Process binding [rank 0]: cpuset [USED     ]: 0x000fffff
Process binding [rank 0]: cpuset [FREE     ]: 0x0fffff00,0x0
W@00000 parsec_hwloc: couldn't bind to mask cpuset  0x0
Process binding [rank 0]: cpuset [ALLOWED  ]: 0x0000ffff,0xf00000ff,0xfff00000
Process binding [rank 0]: cpuset [USED     ]: 0x000000ff,0xfff00000
Process binding [rank 0]: cpuset [FREE     ]: 0x0000ffff,0xf0000000,0x0
W@00001 parsec_hwloc: couldn't bind to mask cpuset  0x0
i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
	Parsec Streams     : 20
	clockRate (GHz)    : 2.20
	peak Gflops        : double 176.0000, single 352.0000
i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
	Parsec Streams     : 20
	clockRate (GHz)    : 2.20
	peak Gflops        : double 176.0000, single 352.0000
i@00000 Virtual Process Map with 1 VPs...
i@00000    Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00000     Thread 0 of VP 0 can be bound on cores 0x00000001
i@00000     Thread 1 of VP 0 can be bound on cores 0x00000002
i@00000     Thread 2 of VP 0 can be bound on cores 0x00000004
i@00000     Thread 3 of VP 0 can be bound on cores 0x00000008
i@00000     Thread 4 of VP 0 can be bound on cores 0x00000010
i@00000     Thread 5 of VP 0 can be bound on cores 0x00000020
i@00000     Thread 6 of VP 0 can be bound on cores 0x00000040
i@00000     Thread 7 of VP 0 can be bound on cores 0x00000080
i@00000     Thread 8 of VP 0 can be bound on cores 0x00000100
i@00000     Thread 9 of VP 0 can be bound on cores 0x00000200
i@00000     Thread 10 of VP 0 can be bound on cores 0x00000400
i@00000     Thread 11 of VP 0 can be bound on cores 0x00000800
i@00000     Thread 12 of VP 0 can be bound on cores 0x00001000
i@00000     Thread 13 of VP 0 can be bound on cores 0x00002000
i@00000     Thread 14 of VP 0 can be bound on cores 0x00004000
i@00000     Thread 15 of VP 0 can be bound on cores 0x00008000
i@00000     Thread 16 of VP 0 can be bound on cores 0x00010000
i@00000     Thread 17 of VP 0 can be bound on cores 0x00020000
i@00000     Thread 18 of VP 0 can be bound on cores 0x00040000
i@00000     Thread 19 of VP 0 can be bound on cores 0x00080000
i@00001 Virtual Process Map with 1 VPs...
i@00001    Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00001     Thread 0 of VP 0 can be bound on cores 0x00000001
i@00001     Thread 1 of VP 0 can be bound on cores 0x00000002
i@00001     Thread 2 of VP 0 can be bound on cores 0x00000004
i@00001     Thread 3 of VP 0 can be bound on cores 0x00000008
i@00001     Thread 4 of VP 0 can be bound on cores 0x00000010
i@00001     Thread 5 of VP 0 can be bound on cores 0x00000020
i@00001     Thread 6 of VP 0 can be bound on cores 0x00000040
i@00001     Thread 7 of VP 0 can be bound on cores 0x00000080
i@00001     Thread 8 of VP 0 can be bound on cores 0x00000100
i@00001     Thread 9 of VP 0 can be bound on cores 0x00000200
i@00001     Thread 10 of VP 0 can be bound on cores 0x00000400
i@00001     Thread 11 of VP 0 can be bound on cores 0x00000800
i@00001     Thread 12 of VP 0 can be bound on cores 0x00001000
i@00001     Thread 13 of VP 0 can be bound on cores 0x00002000
i@00001     Thread 14 of VP 0 can be bound on cores 0x00004000
i@00001     Thread 15 of VP 0 can be bound on cores 0x00008000
i@00001     Thread 16 of VP 0 can be bound on cores 0x00010000
i@00001     Thread 17 of VP 0 can be bound on cores 0x00020000
i@00001     Thread 18 of VP 0 can be bound on cores 0x00040000
i@00001     Thread 19 of VP 0 can be bound on cores 0x00080000

It looks like the two ranks are sharing the same cores from the output?

A run of htop in a parallel terminal shows that only cores 0, 1, 13, 20, 21, 40, 41 and 65 have work to do (plus a bit for 77 and 46 sometimes), but definitely not all cores are active, and it's pretty slow.

Also, I tried to rebase the PR on the current master, but there are conflicts I'm not sure how to solve.

bosilca · 2023-08-30T22:37:50Z

I was not able to find a way to translate between relative and absolute core numbering, so the reported bindings are relative to the allowed procs, and not absolute (as one would expect). Let me look again at the documentation to see if there is a way.

Follow the process binding provided by the batch scheduler or process manager. If the application is started with mpirun, follow the bindings provided by the mpirun command. Add an MCA parameter (runtime_report_bindings) to report the final bindings for each virtual process and thread. Report the binding in both logical and physical notation. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

This option allow threads to run on a single physical resource. The singlification can be done early (for negative values of the MCA parameter) and will pack the threads on the resources, or late (for positive values of the MCA parameter) in which case the threads will be spread across the resources. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

evaleev · 2023-11-01T17:23:59Z

@bosilca ping ... need it badly :)

evaleev · 2023-12-21T11:54:40Z

@bosilca ping again .. real showstopper

abouteiller · 2024-02-15T15:06:05Z

doing the merge now

abouteiller · 2024-03-01T21:15:43Z

parsec/vpmap.c

+static int parsec_nb_total_threads = 0;
 static int parse_binding_parameter(int vp, int nbth, char * binding);

-int vpmap_get_nb_total_threads(void)


This is used by dplasma, and is removed without replacement. The new parsec_context_query(parsec_context_t*) could be used instead but we don't have the parsec context at the caller site. I substituted with parsec_vpmap_get_vp_thread(0) in dplasma as that was probably the intent anyway (assuming vps have symmetrical number of threads). Should we reintroduce the function?

abouteiller · 2024-03-01T21:21:37Z

parsec/parsec.c

        int core = atoi(option);
-        if( (core > -1) && (core < parsec_hwloc_nb_real_cores()) )
-            context->comm_th_core = core;
+        /* negative core allowed to force an absolute core selection */


This is not true, passing a negative value to find_core_by_idx will cause an infinite loop.

@bosilca is the comment wrong or the code wrong?

parsec/parsec_hwloc.c

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

abouteiller · 2024-03-07T16:36:28Z

The following command produces the correct binding for testing (as witnessed from hwloc-ls, the binding is correctly restricted by mpiexec to the correct sockets). '

PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings hwloc-ls --restrict binding -c --no-io 
salloc: Granted job allocation 5788
[1,0]<stderr>:[hexane:3980861] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3980861] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,0]<stdout>:Machine (62GB total) cpuset=0x55555555
[1,0]<stdout>:  Package L#0 cpuset=0x55555555
[1,0]<stdout>:    NUMANode L#0 (P#0 31GB) cpuset=0x55555555
[1,0]<stdout>:    L3 L#0 (12MB) cpuset=0x55555555
[1,0]<stdout>:      L2 L#0 (1280KB) cpuset=0x00010001
[1,0]<stdout>:        L1d L#0 (48KB) cpuset=0x00010001
[1,0]<stdout>:          L1i L#0 (32KB) cpuset=0x00010001
[1,0]<stdout>:            Core L#0 cpuset=0x00010001
[1,0]<stdout>:              PU L#0 (P#0) cpuset=0x00000001
[1,0]<stdout>:              PU L#1 (P#16) cpuset=0x00010000
[1,0]<stdout>:      L2 L#1 (1280KB) cpuset=0x00040004
[1,0]<stdout>:        L1d L#1 (48KB) cpuset=0x00040004
[1,0]<stdout>:          L1i L#1 (32KB) cpuset=0x00040004
[1,0]<stdout>:            Core L#1 cpuset=0x00040004
[1,0]<stdout>:              PU L#2 (P#2) cpuset=0x00000004
[1,0]<stdout>:              PU L#3 (P#18) cpuset=0x00040000
[1,0]<stdout>:      L2 L#2 (1280KB) cpuset=0x00100010
[1,0]<stdout>:        L1d L#2 (48KB) cpuset=0x00100010
[1,0]<stdout>:          L1i L#2 (32KB) cpuset=0x00100010
[1,0]<stdout>:            Core L#2 cpuset=0x00100010
[1,0]<stdout>:              PU L#4 (P#4) cpuset=0x00000010
[1,0]<stdout>:              PU L#5 (P#20) cpuset=0x00100000
[1,0]<stdout>:      L2 L#3 (1280KB) cpuset=0x00400040
[1,0]<stdout>:        L1d L#3 (48KB) cpuset=0x00400040
[1,0]<stdout>:          L1i L#3 (32KB) cpuset=0x00400040
[1,0]<stdout>:            Core L#3 cpuset=0x00400040
[1,0]<stdout>:              PU L#6 (P#6) cpuset=0x00000040
[1,0]<stdout>:              PU L#7 (P#22) cpuset=0x00400000
[1,0]<stdout>:      L2 L#4 (1280KB) cpuset=0x01000100
[1,0]<stdout>:        L1d L#4 (48KB) cpuset=0x01000100
[1,0]<stdout>:          L1i L#4 (32KB) cpuset=0x01000100
[1,0]<stdout>:            Core L#4 cpuset=0x01000100
[1,0]<stdout>:              PU L#8 (P#8) cpuset=0x00000100
[1,0]<stdout>:              PU L#9 (P#24) cpuset=0x01000000
[1,0]<stdout>:      L2 L#5 (1280KB) cpuset=0x04000400
[1,0]<stdout>:        L1d L#5 (48KB) cpuset=0x04000400
[1,0]<stdout>:          L1i L#5 (32KB) cpuset=0x04000400
[1,0]<stdout>:            Core L#5 cpuset=0x04000400
[1,0]<stdout>:              PU L#10 (P#10) cpuset=0x00000400
[1,0]<stdout>:              PU L#11 (P#26) cpuset=0x04000000
[1,0]<stdout>:      L2 L#6 (1280KB) cpuset=0x10001000
[1,0]<stdout>:        L1d L#6 (48KB) cpuset=0x10001000
[1,0]<stdout>:          L1i L#6 (32KB) cpuset=0x10001000
[1,0]<stdout>:            Core L#6 cpuset=0x10001000
[1,0]<stdout>:              PU L#12 (P#12) cpuset=0x00001000
[1,0]<stdout>:              PU L#13 (P#28) cpuset=0x10000000
[1,0]<stdout>:      L2 L#7 (1280KB) cpuset=0x40004000
[1,0]<stdout>:        L1d L#7 (48KB) cpuset=0x40004000
[1,0]<stdout>:          L1i L#7 (32KB) cpuset=0x40004000
[1,0]<stdout>:            Core L#7 cpuset=0x40004000
[1,0]<stdout>:              PU L#14 (P#14) cpuset=0x00004000
[1,0]<stdout>:              PU L#15 (P#30) cpuset=0x40000000
[1,0]<stdout>:  Package L#1 cpuset=0x0
[1,0]<stdout>:    NUMANode L#1 (P#1 31GB) cpuset=0x0
[1,1]<stdout>:Machine (62GB total) cpuset=0xaaaaaaaa
[1,1]<stdout>:  Package L#0 cpuset=0xaaaaaaaa
[1,1]<stdout>:    NUMANode L#0 (P#1 31GB) cpuset=0xaaaaaaaa
[1,1]<stdout>:    L3 L#0 (12MB) cpuset=0xaaaaaaaa
[1,1]<stdout>:      L2 L#0 (1280KB) cpuset=0x00020002
[1,1]<stdout>:        L1d L#0 (48KB) cpuset=0x00020002
[1,1]<stdout>:          L1i L#0 (32KB) cpuset=0x00020002
[1,1]<stdout>:            Core L#0 cpuset=0x00020002
[1,1]<stdout>:              PU L#0 (P#1) cpuset=0x00000002
[1,1]<stdout>:              PU L#1 (P#17) cpuset=0x00020000
[1,1]<stdout>:      L2 L#1 (1280KB) cpuset=0x00080008
[1,1]<stdout>:        L1d L#1 (48KB) cpuset=0x00080008
[1,1]<stdout>:          L1i L#1 (32KB) cpuset=0x00080008
[1,1]<stdout>:            Core L#1 cpuset=0x00080008
[1,1]<stdout>:              PU L#2 (P#3) cpuset=0x00000008
[1,1]<stdout>:              PU L#3 (P#19) cpuset=0x00080000
[1,1]<stdout>:      L2 L#2 (1280KB) cpuset=0x00200020
[1,1]<stdout>:        L1d L#2 (48KB) cpuset=0x00200020
[1,1]<stdout>:          L1i L#2 (32KB) cpuset=0x00200020
[1,1]<stdout>:            Core L#2 cpuset=0x00200020
[1,1]<stdout>:              PU L#4 (P#5) cpuset=0x00000020
[1,1]<stdout>:              PU L#5 (P#21) cpuset=0x00200000
[1,1]<stdout>:      L2 L#3 (1280KB) cpuset=0x00800080
[1,1]<stdout>:        L1d L#3 (48KB) cpuset=0x00800080
[1,1]<stdout>:          L1i L#3 (32KB) cpuset=0x00800080
[1,1]<stdout>:            Core L#3 cpuset=0x00800080
[1,1]<stdout>:              PU L#6 (P#7) cpuset=0x00000080
[1,1]<stdout>:              PU L#7 (P#23) cpuset=0x00800000
[1,1]<stdout>:      L2 L#4 (1280KB) cpuset=0x02000200
[1,1]<stdout>:        L1d L#4 (48KB) cpuset=0x02000200
[1,1]<stdout>:          L1i L#4 (32KB) cpuset=0x02000200
[1,1]<stdout>:            Core L#4 cpuset=0x02000200
[1,1]<stdout>:              PU L#8 (P#9) cpuset=0x00000200
[1,1]<stdout>:              PU L#9 (P#25) cpuset=0x02000000
[1,1]<stdout>:      L2 L#5 (1280KB) cpuset=0x08000800
[1,1]<stdout>:        L1d L#5 (48KB) cpuset=0x08000800
[1,1]<stdout>:          L1i L#5 (32KB) cpuset=0x08000800
[1,1]<stdout>:            Core L#5 cpuset=0x08000800
[1,1]<stdout>:              PU L#10 (P#11) cpuset=0x00000800
[1,1]<stdout>:              PU L#11 (P#27) cpuset=0x08000000
[1,1]<stdout>:      L2 L#6 (1280KB) cpuset=0x20002000
[1,1]<stdout>:        L1d L#6 (48KB) cpuset=0x20002000
[1,1]<stdout>:          L1i L#6 (32KB) cpuset=0x20002000
[1,1]<stdout>:            Core L#6 cpuset=0x20002000
[1,1]<stdout>:              PU L#12 (P#13) cpuset=0x00002000
[1,1]<stdout>:              PU L#13 (P#29) cpuset=0x20000000
[1,1]<stdout>:      L2 L#7 (1280KB) cpuset=0x80008000
[1,1]<stdout>:        L1d L#7 (48KB) cpuset=0x80008000
[1,1]<stdout>:          L1i L#7 (32KB) cpuset=0x80008000
[1,1]<stdout>:            Core L#7 cpuset=0x80008000
[1,1]<stdout>:              PU L#14 (P#15) cpuset=0x00008000
[1,1]<stdout>:              PU L#15 (P#31) cpuset=0x80000000
[1,1]<stdout>:  Package L#1 cpuset=0x0
[1,1]<stdout>:    NUMANode L#1 (P#0 31GB) cpuset=0x0
salloc: Relinquishing job allocation 5788

The resultant binding in Parsec does not appear correct:

PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/parsec/tests/api/init_fini --mca runtime_report_bindings 1
salloc: Granted job allocation 5790
[1,0]<stderr>:[hexane:3981005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3981005] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,1]<stdout>:Process binding [rank 0]: cpuset [ALLOWED  ]: 0xaaaaaaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [USED     ]: 0x0000aaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [FREE     ]: 0xaaaa0000
[1,0]<stdout>:Process binding [rank 0]: cpuset [ALLOWED  ]: 0x55555555
[1,0]<stdout>:Process binding [rank 0]: cpuset [USED     ]: 0x00005555
[1,0]<stdout>:Process binding [rank 0]: cpuset [FREE     ]: 0x55550000
[1,0]<stderr>:W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
[1,0]<stderr>:i@00000 Virtual Process Map with 1 VPs...
[1,0]<stderr>:i@00000    Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,0]<stderr>:           physical cpuset 0x55555555
[1,0]<stderr>:i@00000     Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,0]<stderr>:i@00000     Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,0]<stderr>:i@00000     Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,0]<stderr>:i@00000     Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,0]<stderr>:i@00000     Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,0]<stderr>:i@00000     Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,0]<stderr>:i@00000     Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,0]<stderr>:i@00000     Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
[1,1]<stderr>:i@00001 Virtual Process Map with 1 VPs...
[1,1]<stderr>:i@00001    Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,1]<stderr>:           physical cpuset 0x55555555
[1,1]<stderr>:i@00001     Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,1]<stderr>:i@00001     Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,1]<stderr>:i@00001     Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,1]<stderr>:i@00001     Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,1]<stderr>:i@00001     Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,1]<stderr>:i@00001     Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,1]<stderr>:i@00001     Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,1]<stderr>:i@00001     Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
salloc: Relinquishing job allocation 5790

abouteiller · 2024-03-07T17:11:03Z

It looks like the message above is partially misleading: we initialized the vpmap before we extract the ALLOWED mask, so the vpmap_from_flat initializes something that is not the same as what the real final binding is. This is a problem because using VPs will do something different than not using VPs, and the output is misleading, but the actual binding produced should be correct when not using VPs.

A potential solution is to move up the initialization of the ALLOWED_MASK early in parsec_init, and use the ALLOWED mask the same way we use it in the parse_binding_parameters function to restrict vpmap creation, but that may be restrictive for some vpmap from file or otherwise. Thoughts?

abouteiller · 2024-03-07T18:34:42Z

Looks like inherited binding is correct (beside the problem with the vpmap above, comparing --bind-to none -c 8 vs --bind-to socket results in double the performance and using 16 hw cores in the second case and not having the message about running oversubscribed.

 PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to none --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384 -c 8

[1,1]<stderr>:W@00001 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
[1,1]<stderr>:  This is often unintentional, and will perform poorly.
[1,1]<stderr>:  Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
[1,1]<stderr>:  and hide the real binding from PaRSEC; if you verified that the binding is correct,
[1,1]<stderr>:  this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
[1,0]<stderr>:#+++++ cores detected       : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu  : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s)     31.77802 : dpotrf      PxQxg=   2 1   0 NB=  384 N=   30000 :     283.228825 gflops - ENQ&PROG&DEST     31.80433 :     282.994460 gflops - ENQ      0.02591 - DEST      0.00040

...

 PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384
[1,0]<stderr>:#+++++ cores detected       : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu  : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s)     19.26431 : dpotrf      PxQxg=   2 1   0 NB=  384 N=   30000 :     467.208496 gflops - ENQ&PROG&DEST     19.26601 :     467.167396 gflops - ENQ      0.00008 - DEST      0.00161

abouteiller · 2024-03-07T20:21:36Z

vpmap initialization creates and fills parsec_vpmap[vp].threads[t+ht].cpuset = HWLOC_ALLOC(); with all sorts of intricate things (that are not abiding with the restricted mask) but these are write only variables.

~~At this point I propose we merge this PR with the broken vpmap, and create a tracking issue to fix them later in v4.1. Poll below.~~

At this point I will excise the rework on the vpmap, merge the rework that is effective in the flat case, and defer completion of complex vpmap process binding to 4.1.

write only atm. Use the bindthread to print the actual binding effected, if requested.

…ying the general binding.

bosilca added the blocker Blocking release or critical use case label Apr 5, 2023

bosilca added this to the v4.0 milestone Apr 5, 2023

bosilca self-assigned this Apr 5, 2023

bosilca requested a review from a team as a code owner April 5, 2023 18:52

bosilca marked this pull request as draft April 5, 2023 18:53

devreal reviewed Apr 5, 2023

View reviewed changes

parsec/parsec.c Show resolved Hide resolved

parsec/parsec.c Outdated Show resolved Hide resolved

bosilca force-pushed the topic/revised_binding branch 2 times, most recently from d9006b7 to f5af19f Compare April 6, 2023 21:19

bosilca added 2 commits October 12, 2023 18:49

bosilca force-pushed the topic/revised_binding branch from d6334a6 to be2ee69 Compare October 12, 2023 22:50

checkpoint

412f23c

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

abouteiller self-assigned this Feb 15, 2024

Merge branch 'master' into topic/revised_binding

577785c

abouteiller reviewed Mar 1, 2024

View reviewed changes

parsec/parsec_hwloc.c Show resolved Hide resolved

abouteiller and others added 4 commits March 1, 2024 17:35

parsec_binding: all ctests but class/atomics pass

e6ed1d2

Initialize the parsec's HWLOC subsystem before starting threads.

f8d5bd2

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Make picky compilers happy.

67b5869

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Merge branch 'master' into topic/revised_binding

b2c98f5

Disable the vpmap display of binding parameters, and warn that these are

4c8e9a9

write only atm. Use the bindthread to print the actual binding effected, if requested.

displaying the vpmap (runtime_vpmap display:) is separate from displa…

cbcc645

…ying the general binding.

abouteiller mentioned this pull request Apr 16, 2024

Initialize the parsec's HWLOC subsystem before starting threads. #650

Merged

abouteiller mentioned this pull request Oct 24, 2024

Update to new parsec vpmap API (parsec PR 515) ICLDisco/dplasma#127

Draft

abouteiller linked an issue Oct 24, 2024 that may be closed by this pull request

Thread pinning issues running on Cray with Open MPI? ICLDisco/dplasma#9

Closed

abouteiller modified the milestones: v4.0, v4.1 Nov 6, 2024

devreal mentioned this pull request Feb 7, 2025

Disable thread binding #730

Merged

Revamp of the binding layer. #515

Are you sure you want to change the base?

Revamp of the binding layer. #515

Uh oh!

Conversation

bosilca commented Apr 5, 2023

Uh oh!

devreal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

therault commented Aug 30, 2023

Uh oh!

bosilca commented Aug 30, 2023

Uh oh!

evaleev commented Nov 1, 2023

Uh oh!

evaleev commented Dec 21, 2023

Uh oh!

abouteiller commented Feb 15, 2024

Uh oh!

abouteiller Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

abouteiller Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

abouteiller Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abouteiller commented Mar 7, 2024

Uh oh!

abouteiller commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abouteiller commented Mar 7, 2024

Uh oh!

abouteiller commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

abouteiller commented Mar 7, 2024 •

edited

Loading

abouteiller commented Mar 7, 2024 •

edited

Loading