Skip to content

Conversation

@bosilca
Copy link
Contributor

@bosilca bosilca commented Apr 5, 2023

Rework the bindings. The main idea is to inherit the bindings from the batch scheduler, and then work from there.

@bosilca bosilca added the blocker Blocking release or critical use case label Apr 5, 2023
@bosilca bosilca added this to the v4.0 milestone Apr 5, 2023
@bosilca bosilca self-assigned this Apr 5, 2023
@bosilca bosilca requested a review from a team as a code owner April 5, 2023 18:52
@bosilca bosilca marked this pull request as draft April 5, 2023 18:53
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much needed 👍 two comments inline, otherwise LGTM

@bosilca bosilca force-pushed the topic/revised_binding branch 2 times, most recently from d9006b7 to f5af19f Compare April 6, 2023 21:19
@therault
Copy link
Contributor

I'm trying it on leconte.

mpirun -np 2 --map-by socket hwloc-info --restrict binding package:0
Package L#0
 [...]
 cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 complete cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 allowed cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 nodeset = 0x00000002
 complete nodeset = 0x00000002
 allowed nodeset = 0x00000002
 [...]
Package L#0
 [...]
 cpuset = 0x0fffff00,0x000fffff
 complete cpuset = 0x0fffff00,0x000fffff
 allowed cpuset = 0x0fffff00,0x000fffff
 nodeset = 0x00000001
 complete nodeset = 0x00000001
 allowed nodeset = 0x00000001
  [...]

Which I interpret as when running mpirun -np 2 --map-by socket on this machine (module load openmpi), there are two processes, and rank 0 should use a different set of cores than rank 1 (and the cpuset seems complicated, but it does seem exclusive).

Now, I run a parsec test with this PR:

mpirun -np 2 --map-by socket ./tests/apps/stencil/testing_stencil_1D -M 40960 -N 40960 -t 16 -T 16 -P 2
Process binding [rank 0]: cpuset [ALLOWED  ]: 0x0fffff00,0x000fffff
Process binding [rank 0]: cpuset [USED     ]: 0x000fffff
Process binding [rank 0]: cpuset [FREE     ]: 0x0fffff00,0x0
W@00000 parsec_hwloc: couldn't bind to mask cpuset  0x0
Process binding [rank 0]: cpuset [ALLOWED  ]: 0x0000ffff,0xf00000ff,0xfff00000
Process binding [rank 0]: cpuset [USED     ]: 0x000000ff,0xfff00000
Process binding [rank 0]: cpuset [FREE     ]: 0x0000ffff,0xf0000000,0x0
W@00001 parsec_hwloc: couldn't bind to mask cpuset  0x0
i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
	Parsec Streams     : 20
	clockRate (GHz)    : 2.20
	peak Gflops        : double 176.0000, single 352.0000
i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
	Parsec Streams     : 20
	clockRate (GHz)    : 2.20
	peak Gflops        : double 176.0000, single 352.0000
i@00000 Virtual Process Map with 1 VPs...
i@00000    Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00000     Thread 0 of VP 0 can be bound on cores 0x00000001
i@00000     Thread 1 of VP 0 can be bound on cores 0x00000002
i@00000     Thread 2 of VP 0 can be bound on cores 0x00000004
i@00000     Thread 3 of VP 0 can be bound on cores 0x00000008
i@00000     Thread 4 of VP 0 can be bound on cores 0x00000010
i@00000     Thread 5 of VP 0 can be bound on cores 0x00000020
i@00000     Thread 6 of VP 0 can be bound on cores 0x00000040
i@00000     Thread 7 of VP 0 can be bound on cores 0x00000080
i@00000     Thread 8 of VP 0 can be bound on cores 0x00000100
i@00000     Thread 9 of VP 0 can be bound on cores 0x00000200
i@00000     Thread 10 of VP 0 can be bound on cores 0x00000400
i@00000     Thread 11 of VP 0 can be bound on cores 0x00000800
i@00000     Thread 12 of VP 0 can be bound on cores 0x00001000
i@00000     Thread 13 of VP 0 can be bound on cores 0x00002000
i@00000     Thread 14 of VP 0 can be bound on cores 0x00004000
i@00000     Thread 15 of VP 0 can be bound on cores 0x00008000
i@00000     Thread 16 of VP 0 can be bound on cores 0x00010000
i@00000     Thread 17 of VP 0 can be bound on cores 0x00020000
i@00000     Thread 18 of VP 0 can be bound on cores 0x00040000
i@00000     Thread 19 of VP 0 can be bound on cores 0x00080000
i@00001 Virtual Process Map with 1 VPs...
i@00001    Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00001     Thread 0 of VP 0 can be bound on cores 0x00000001
i@00001     Thread 1 of VP 0 can be bound on cores 0x00000002
i@00001     Thread 2 of VP 0 can be bound on cores 0x00000004
i@00001     Thread 3 of VP 0 can be bound on cores 0x00000008
i@00001     Thread 4 of VP 0 can be bound on cores 0x00000010
i@00001     Thread 5 of VP 0 can be bound on cores 0x00000020
i@00001     Thread 6 of VP 0 can be bound on cores 0x00000040
i@00001     Thread 7 of VP 0 can be bound on cores 0x00000080
i@00001     Thread 8 of VP 0 can be bound on cores 0x00000100
i@00001     Thread 9 of VP 0 can be bound on cores 0x00000200
i@00001     Thread 10 of VP 0 can be bound on cores 0x00000400
i@00001     Thread 11 of VP 0 can be bound on cores 0x00000800
i@00001     Thread 12 of VP 0 can be bound on cores 0x00001000
i@00001     Thread 13 of VP 0 can be bound on cores 0x00002000
i@00001     Thread 14 of VP 0 can be bound on cores 0x00004000
i@00001     Thread 15 of VP 0 can be bound on cores 0x00008000
i@00001     Thread 16 of VP 0 can be bound on cores 0x00010000
i@00001     Thread 17 of VP 0 can be bound on cores 0x00020000
i@00001     Thread 18 of VP 0 can be bound on cores 0x00040000
i@00001     Thread 19 of VP 0 can be bound on cores 0x00080000

It looks like the two ranks are sharing the same cores from the output?

A run of htop in a parallel terminal shows that only cores 0, 1, 13, 20, 21, 40, 41 and 65 have work to do (plus a bit for 77 and 46 sometimes), but definitely not all cores are active, and it's pretty slow.

Also, I tried to rebase the PR on the current master, but there are conflicts I'm not sure how to solve.

@bosilca
Copy link
Contributor Author

bosilca commented Aug 30, 2023

I was not able to find a way to translate between relative and absolute core numbering, so the reported bindings are relative to the allowed procs, and not absolute (as one would expect). Let me look again at the documentation to see if there is a way.

Follow the process binding provided by the batch scheduler or process manager.
If the application is started with mpirun, follow the bindings provided by the
mpirun command.

Add an MCA parameter (runtime_report_bindings) to report the final bindings for
each virtual process and thread. Report the binding in both logical and
physical notation.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This option allow threads to run on a single physical resource. The
singlification can be done early (for negative values of the MCA parameter) and
will pack the threads on the resources, or late (for positive values of the MCA
parameter) in which case the threads will be spread across the resources.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca bosilca force-pushed the topic/revised_binding branch from d6334a6 to be2ee69 Compare October 12, 2023 22:50
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@evaleev
Copy link
Contributor

evaleev commented Nov 1, 2023

@bosilca ping ... need it badly :)

@evaleev
Copy link
Contributor

evaleev commented Dec 21, 2023

@bosilca ping again .. real showstopper

@abouteiller abouteiller self-assigned this Feb 15, 2024
@abouteiller
Copy link
Contributor

doing the merge now

static int parsec_nb_total_threads = 0;
static int parse_binding_parameter(int vp, int nbth, char * binding);

int vpmap_get_nb_total_threads(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used by dplasma, and is removed without replacement. The new parsec_context_query(parsec_context_t*) could be used instead but we don't have the parsec context at the caller site. I substituted with parsec_vpmap_get_vp_thread(0) in dplasma as that was probably the intent anyway (assuming vps have symmetrical number of threads). Should we reintroduce the function?

parsec/parsec.c Outdated
int core = atoi(option);
if( (core > -1) && (core < parsec_hwloc_nb_real_cores()) )
context->comm_th_core = core;
/* negative core allowed to force an absolute core selection */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, passing a negative value to find_core_by_idx will cause an infinite loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca is the comment wrong or the code wrong?

@abouteiller
Copy link
Contributor

The following command produces the correct binding for testing (as witnessed from hwloc-ls, the binding is correctly restricted by mpiexec to the correct sockets). '

PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings hwloc-ls --restrict binding -c --no-io 
salloc: Granted job allocation 5788
[1,0]<stderr>:[hexane:3980861] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3980861] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,0]<stdout>:Machine (62GB total) cpuset=0x55555555
[1,0]<stdout>:  Package L#0 cpuset=0x55555555
[1,0]<stdout>:    NUMANode L#0 (P#0 31GB) cpuset=0x55555555
[1,0]<stdout>:    L3 L#0 (12MB) cpuset=0x55555555
[1,0]<stdout>:      L2 L#0 (1280KB) cpuset=0x00010001
[1,0]<stdout>:        L1d L#0 (48KB) cpuset=0x00010001
[1,0]<stdout>:          L1i L#0 (32KB) cpuset=0x00010001
[1,0]<stdout>:            Core L#0 cpuset=0x00010001
[1,0]<stdout>:              PU L#0 (P#0) cpuset=0x00000001
[1,0]<stdout>:              PU L#1 (P#16) cpuset=0x00010000
[1,0]<stdout>:      L2 L#1 (1280KB) cpuset=0x00040004
[1,0]<stdout>:        L1d L#1 (48KB) cpuset=0x00040004
[1,0]<stdout>:          L1i L#1 (32KB) cpuset=0x00040004
[1,0]<stdout>:            Core L#1 cpuset=0x00040004
[1,0]<stdout>:              PU L#2 (P#2) cpuset=0x00000004
[1,0]<stdout>:              PU L#3 (P#18) cpuset=0x00040000
[1,0]<stdout>:      L2 L#2 (1280KB) cpuset=0x00100010
[1,0]<stdout>:        L1d L#2 (48KB) cpuset=0x00100010
[1,0]<stdout>:          L1i L#2 (32KB) cpuset=0x00100010
[1,0]<stdout>:            Core L#2 cpuset=0x00100010
[1,0]<stdout>:              PU L#4 (P#4) cpuset=0x00000010
[1,0]<stdout>:              PU L#5 (P#20) cpuset=0x00100000
[1,0]<stdout>:      L2 L#3 (1280KB) cpuset=0x00400040
[1,0]<stdout>:        L1d L#3 (48KB) cpuset=0x00400040
[1,0]<stdout>:          L1i L#3 (32KB) cpuset=0x00400040
[1,0]<stdout>:            Core L#3 cpuset=0x00400040
[1,0]<stdout>:              PU L#6 (P#6) cpuset=0x00000040
[1,0]<stdout>:              PU L#7 (P#22) cpuset=0x00400000
[1,0]<stdout>:      L2 L#4 (1280KB) cpuset=0x01000100
[1,0]<stdout>:        L1d L#4 (48KB) cpuset=0x01000100
[1,0]<stdout>:          L1i L#4 (32KB) cpuset=0x01000100
[1,0]<stdout>:            Core L#4 cpuset=0x01000100
[1,0]<stdout>:              PU L#8 (P#8) cpuset=0x00000100
[1,0]<stdout>:              PU L#9 (P#24) cpuset=0x01000000
[1,0]<stdout>:      L2 L#5 (1280KB) cpuset=0x04000400
[1,0]<stdout>:        L1d L#5 (48KB) cpuset=0x04000400
[1,0]<stdout>:          L1i L#5 (32KB) cpuset=0x04000400
[1,0]<stdout>:            Core L#5 cpuset=0x04000400
[1,0]<stdout>:              PU L#10 (P#10) cpuset=0x00000400
[1,0]<stdout>:              PU L#11 (P#26) cpuset=0x04000000
[1,0]<stdout>:      L2 L#6 (1280KB) cpuset=0x10001000
[1,0]<stdout>:        L1d L#6 (48KB) cpuset=0x10001000
[1,0]<stdout>:          L1i L#6 (32KB) cpuset=0x10001000
[1,0]<stdout>:            Core L#6 cpuset=0x10001000
[1,0]<stdout>:              PU L#12 (P#12) cpuset=0x00001000
[1,0]<stdout>:              PU L#13 (P#28) cpuset=0x10000000
[1,0]<stdout>:      L2 L#7 (1280KB) cpuset=0x40004000
[1,0]<stdout>:        L1d L#7 (48KB) cpuset=0x40004000
[1,0]<stdout>:          L1i L#7 (32KB) cpuset=0x40004000
[1,0]<stdout>:            Core L#7 cpuset=0x40004000
[1,0]<stdout>:              PU L#14 (P#14) cpuset=0x00004000
[1,0]<stdout>:              PU L#15 (P#30) cpuset=0x40000000
[1,0]<stdout>:  Package L#1 cpuset=0x0
[1,0]<stdout>:    NUMANode L#1 (P#1 31GB) cpuset=0x0
[1,1]<stdout>:Machine (62GB total) cpuset=0xaaaaaaaa
[1,1]<stdout>:  Package L#0 cpuset=0xaaaaaaaa
[1,1]<stdout>:    NUMANode L#0 (P#1 31GB) cpuset=0xaaaaaaaa
[1,1]<stdout>:    L3 L#0 (12MB) cpuset=0xaaaaaaaa
[1,1]<stdout>:      L2 L#0 (1280KB) cpuset=0x00020002
[1,1]<stdout>:        L1d L#0 (48KB) cpuset=0x00020002
[1,1]<stdout>:          L1i L#0 (32KB) cpuset=0x00020002
[1,1]<stdout>:            Core L#0 cpuset=0x00020002
[1,1]<stdout>:              PU L#0 (P#1) cpuset=0x00000002
[1,1]<stdout>:              PU L#1 (P#17) cpuset=0x00020000
[1,1]<stdout>:      L2 L#1 (1280KB) cpuset=0x00080008
[1,1]<stdout>:        L1d L#1 (48KB) cpuset=0x00080008
[1,1]<stdout>:          L1i L#1 (32KB) cpuset=0x00080008
[1,1]<stdout>:            Core L#1 cpuset=0x00080008
[1,1]<stdout>:              PU L#2 (P#3) cpuset=0x00000008
[1,1]<stdout>:              PU L#3 (P#19) cpuset=0x00080000
[1,1]<stdout>:      L2 L#2 (1280KB) cpuset=0x00200020
[1,1]<stdout>:        L1d L#2 (48KB) cpuset=0x00200020
[1,1]<stdout>:          L1i L#2 (32KB) cpuset=0x00200020
[1,1]<stdout>:            Core L#2 cpuset=0x00200020
[1,1]<stdout>:              PU L#4 (P#5) cpuset=0x00000020
[1,1]<stdout>:              PU L#5 (P#21) cpuset=0x00200000
[1,1]<stdout>:      L2 L#3 (1280KB) cpuset=0x00800080
[1,1]<stdout>:        L1d L#3 (48KB) cpuset=0x00800080
[1,1]<stdout>:          L1i L#3 (32KB) cpuset=0x00800080
[1,1]<stdout>:            Core L#3 cpuset=0x00800080
[1,1]<stdout>:              PU L#6 (P#7) cpuset=0x00000080
[1,1]<stdout>:              PU L#7 (P#23) cpuset=0x00800000
[1,1]<stdout>:      L2 L#4 (1280KB) cpuset=0x02000200
[1,1]<stdout>:        L1d L#4 (48KB) cpuset=0x02000200
[1,1]<stdout>:          L1i L#4 (32KB) cpuset=0x02000200
[1,1]<stdout>:            Core L#4 cpuset=0x02000200
[1,1]<stdout>:              PU L#8 (P#9) cpuset=0x00000200
[1,1]<stdout>:              PU L#9 (P#25) cpuset=0x02000000
[1,1]<stdout>:      L2 L#5 (1280KB) cpuset=0x08000800
[1,1]<stdout>:        L1d L#5 (48KB) cpuset=0x08000800
[1,1]<stdout>:          L1i L#5 (32KB) cpuset=0x08000800
[1,1]<stdout>:            Core L#5 cpuset=0x08000800
[1,1]<stdout>:              PU L#10 (P#11) cpuset=0x00000800
[1,1]<stdout>:              PU L#11 (P#27) cpuset=0x08000000
[1,1]<stdout>:      L2 L#6 (1280KB) cpuset=0x20002000
[1,1]<stdout>:        L1d L#6 (48KB) cpuset=0x20002000
[1,1]<stdout>:          L1i L#6 (32KB) cpuset=0x20002000
[1,1]<stdout>:            Core L#6 cpuset=0x20002000
[1,1]<stdout>:              PU L#12 (P#13) cpuset=0x00002000
[1,1]<stdout>:              PU L#13 (P#29) cpuset=0x20000000
[1,1]<stdout>:      L2 L#7 (1280KB) cpuset=0x80008000
[1,1]<stdout>:        L1d L#7 (48KB) cpuset=0x80008000
[1,1]<stdout>:          L1i L#7 (32KB) cpuset=0x80008000
[1,1]<stdout>:            Core L#7 cpuset=0x80008000
[1,1]<stdout>:              PU L#14 (P#15) cpuset=0x00008000
[1,1]<stdout>:              PU L#15 (P#31) cpuset=0x80000000
[1,1]<stdout>:  Package L#1 cpuset=0x0
[1,1]<stdout>:    NUMANode L#1 (P#0 31GB) cpuset=0x0
salloc: Relinquishing job allocation 5788

The resultant binding in Parsec does not appear correct:

PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/parsec/tests/api/init_fini --mca runtime_report_bindings 1
salloc: Granted job allocation 5790
[1,0]<stderr>:[hexane:3981005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3981005] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,1]<stdout>:Process binding [rank 0]: cpuset [ALLOWED  ]: 0xaaaaaaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [USED     ]: 0x0000aaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [FREE     ]: 0xaaaa0000
[1,0]<stdout>:Process binding [rank 0]: cpuset [ALLOWED  ]: 0x55555555
[1,0]<stdout>:Process binding [rank 0]: cpuset [USED     ]: 0x00005555
[1,0]<stdout>:Process binding [rank 0]: cpuset [FREE     ]: 0x55550000
[1,0]<stderr>:W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
[1,0]<stderr>:i@00000 Virtual Process Map with 1 VPs...
[1,0]<stderr>:i@00000    Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,0]<stderr>:           physical cpuset 0x55555555
[1,0]<stderr>:i@00000     Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,0]<stderr>:i@00000     Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,0]<stderr>:i@00000     Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,0]<stderr>:i@00000     Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,0]<stderr>:i@00000     Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,0]<stderr>:i@00000     Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,0]<stderr>:i@00000     Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,0]<stderr>:i@00000     Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
[1,1]<stderr>:i@00001 Virtual Process Map with 1 VPs...
[1,1]<stderr>:i@00001    Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,1]<stderr>:           physical cpuset 0x55555555
[1,1]<stderr>:i@00001     Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,1]<stderr>:i@00001     Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,1]<stderr>:i@00001     Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,1]<stderr>:i@00001     Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,1]<stderr>:i@00001     Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,1]<stderr>:i@00001     Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,1]<stderr>:i@00001     Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,1]<stderr>:i@00001     Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
salloc: Relinquishing job allocation 5790

@abouteiller
Copy link
Contributor

abouteiller commented Mar 7, 2024

It looks like the message above is partially misleading: we initialized the vpmap before we extract the ALLOWED mask, so the vpmap_from_flat initializes something that is not the same as what the real final binding is. This is a problem because using VPs will do something different than not using VPs, and the output is misleading, but the actual binding produced should be correct when not using VPs.

A potential solution is to move up the initialization of the ALLOWED_MASK early in parsec_init, and use the ALLOWED mask the same way we use it in the parse_binding_parameters function to restrict vpmap creation, but that may be restrictive for some vpmap from file or otherwise. Thoughts?

@abouteiller
Copy link
Contributor

Looks like inherited binding is correct (beside the problem with the vpmap above, comparing --bind-to none -c 8 vs --bind-to socket results in double the performance and using 16 hw cores in the second case and not having the message about running oversubscribed.

 PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to none --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384 -c 8

[1,1]<stderr>:W@00001 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
[1,1]<stderr>:  This is often unintentional, and will perform poorly.
[1,1]<stderr>:  Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
[1,1]<stderr>:  and hide the real binding from PaRSEC; if you verified that the binding is correct,
[1,1]<stderr>:  this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
[1,0]<stderr>:#+++++ cores detected       : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu  : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s)     31.77802 : dpotrf      PxQxg=   2 1   0 NB=  384 N=   30000 :     283.228825 gflops - ENQ&PROG&DEST     31.80433 :     282.994460 gflops - ENQ      0.02591 - DEST      0.00040

...

 PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384
[1,0]<stderr>:#+++++ cores detected       : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu  : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s)     19.26431 : dpotrf      PxQxg=   2 1   0 NB=  384 N=   30000 :     467.208496 gflops - ENQ&PROG&DEST     19.26601 :     467.167396 gflops - ENQ      0.00008 - DEST      0.00161

@abouteiller
Copy link
Contributor

abouteiller commented Mar 7, 2024

vpmap initialization creates and fills parsec_vpmap[vp].threads[t+ht].cpuset = HWLOC_ALLOC(); with all sorts of intricate things (that are not abiding with the restricted mask) but these are write only variables.

At this point I propose we merge this PR with the broken vpmap, and create a tracking issue to fix them later in v4.1. Poll below.

At this point I will excise the rework on the vpmap, merge the rework that is effective in the flat case, and defer completion of complex vpmap process binding to 4.1.

write only atm. Use the bindthread to print the actual binding effected, if
requested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker Blocking release or critical use case

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thread pinning issues running on Cray with Open MPI?

5 participants