Environment:
- OS: CentOS Linux 7.9.2009 (kernel 3.10.0-1160.119.1.el7.x86_64)
- MPI implementation: MPICH 3.3.2 (built with
--with-device=ch3:sock)
- MANA commit: current
master (private copy from https://github.com/mpickpt/mana)
- Build used the provided
mpicc_mana wrapper to compile apps.
Problem:
A simple two-process MPI hello world program can be launched and checkpointed successfully using mana_launch and mana_coordinator. However, when attempting to restart the job with mana_restart under mpirun, the MPI processes are immediately killed by Hydra with exit code 9 (SIGKILL). The lower-half reports that memory has been restored and the crash appears to occur inside the postRestart callback invoked by lower-half.
Reproduction steps:
- Install MPICH 3.3.2 as shown below:
./configure --prefix=/usr/local --with-device=ch3:sock
make && make install
- Clone MANA and build it:
- Compile the test program: ```sh
./bin/mpicc_mana ./mpi-proxy-split/test/mpi_hello_world.c -o mpi_hello_world
ldd mpi_hello_world
% linux-vdso.so.1 => (0x00007fffcc1da000)
% libmpistub.so => /home/artur/mana/lib/dmtcp/libmpistub.so (0x00007f2b5e165000)
% libc.so.6 => /lib64/libc.so.6 (0x00007f2b5db90000)
% /lib64/ld-linux-x86-64.so.2 (0x00007f2b5df5e000)
- Start a coordinator:
./bin/mana_coordinator --verbose
- Launch the application and checkpoint manually:
mpirun -np 2 ./bin/mana_launch --timing --verbose ./mpi_hello_world
# wait for the program to print its hello messages and sleep
# in second console, send checkpoint:
dmtcp> c
dmtcp> k # kill the original run
- Attempt restart:
strace -ff -s256 -o mpirun_restart.%p.log -e trace=network,desc,process,dup2,close,execve,signal \
mpirun -np 2 ./bin/mana_restart --verbose 2>&1 | tee restart_run.log
Observed behavior:
- The lower-half prints debug messages showing memory mapping and "Finished restoring memory data for process with PID X" for each rank.
- Immediately after restoration, Hydra reports:
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 12650 RUNNING AT localhost.localdomain
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
- No segmentation faults or file descriptor errors appear in the strace logs; sockets and descriptors seem to be restored correctly.
- The crash seems to happen immediately after the
postRestart callback (postRestart(0,0)) is invoked in lower-half code.
Hypothesis / analysis:
- The checkpoint file and memory maps appear valid: the lower-half reports complete restoration.
- Hydra (the MPIRUN launcher in MPICH) is killing the process. This could indicate the process is not responding to Hydra's handshake (perhaps due to unexpected state after restoration) or the process calling
exit(9) itself.
- There is no obvious segfault; the
postRestart handler may be returning to corrupted state or performing an operation that Hydra interprets as failure.
Strace attachments:
See attached mpirun_restart.*.log files generated by strace -ff. They contain the system call traces leading up to the kill. (Logs were added with additional DEBUG prints during restore.)
Request for help:
- Can the developers reproduce this basic restart failure on CentOS 7 / MPICH 3.3.2, or is it specific to my configuration?
- Is there a known issue with Hydra killing restarted processes? What conditions cause Hydra to SIGKILL a process after checkpoint/restore?
- Are there additional debug points that should be added around
postRestart or in the MPI wrapper to understand why Hydra is unhappy?
- If this is a bug in MANA's lower-half or in coordinator/restart interaction with MPICH, guidance on a fix would be appreciated.
Additional notes:
- The simple
mpi_hello_world program is identical to the one supplied in mpi-proxy-split/test/.
- The failure occurs on every restart attempt; the original launch and checkpoint stages succeed reliably.
- I can provide complete log files or run additional instrumentation as needed.
Thank you for any assistance resolving this restart problem. A successful restart should "just work" for this simple case, so this seems like a regression or configuration issue.
mpirun_restart.12644.log
mpirun_restart.12648.log
mpirun_restart.12649.log
mpirun_restart.12650.log
mpirun_restart.12651.log
mpirun_restart.12652.log
mpirun_restart.12653.log
mpirun_restart.12654.log
mpirun_restart.12655.log
mpirun_restart.12656.log
mpirun_restart.12657.log
mpirun_restart.12658.log
restart_run.log
Environment:
--with-device=ch3:sock)master(private copy from https://github.com/mpickpt/mana)mpicc_manawrapper to compile apps.Problem:
A simple two-process MPI hello world program can be launched and checkpointed successfully using
mana_launchandmana_coordinator. However, when attempting to restart the job withmana_restartundermpirun, the MPI processes are immediately killed by Hydra with exit code9(SIGKILL). The lower-half reports that memory has been restored and the crash appears to occur inside thepostRestartcallback invoked bylower-half.Reproduction steps:
./bin/mpicc_mana ./mpi-proxy-split/test/mpi_hello_world.c -o mpi_hello_world
ldd mpi_hello_world
% linux-vdso.so.1 => (0x00007fffcc1da000)
% libmpistub.so => /home/artur/mana/lib/dmtcp/libmpistub.so (0x00007f2b5e165000)
% libc.so.6 => /lib64/libc.so.6 (0x00007f2b5db90000)
% /lib64/ld-linux-x86-64.so.2 (0x00007f2b5df5e000)
strace -ff -s256 -o mpirun_restart.%p.log -e trace=network,desc,process,dup2,close,execve,signal \ mpirun -np 2 ./bin/mana_restart --verbose 2>&1 | tee restart_run.logObserved behavior:
postRestartcallback (postRestart(0,0)) is invoked in lower-half code.Hypothesis / analysis:
exit(9)itself.postRestarthandler may be returning to corrupted state or performing an operation that Hydra interprets as failure.Strace attachments:
See attached
mpirun_restart.*.logfiles generated bystrace -ff. They contain the system call traces leading up to the kill. (Logs were added with additional DEBUG prints during restore.)Request for help:
postRestartor in the MPI wrapper to understand why Hydra is unhappy?Additional notes:
mpi_hello_worldprogram is identical to the one supplied inmpi-proxy-split/test/.Thank you for any assistance resolving this restart problem. A successful restart should "just work" for this simple case, so this seems like a regression or configuration issue.
mpirun_restart.12644.log
mpirun_restart.12648.log
mpirun_restart.12649.log
mpirun_restart.12650.log
mpirun_restart.12651.log
mpirun_restart.12652.log
mpirun_restart.12653.log
mpirun_restart.12654.log
mpirun_restart.12655.log
mpirun_restart.12656.log
mpirun_restart.12657.log
mpirun_restart.12658.log
restart_run.log