Skip to content

MANA Restart Failure on CentOS 7.9 #468

@ArturRSoda

Description

@ArturRSoda

Environment:

  • OS: CentOS Linux 7.9.2009 (kernel 3.10.0-1160.119.1.el7.x86_64)
  • MPI implementation: MPICH 3.3.2 (built with --with-device=ch3:sock)
  • MANA commit: current master (private copy from https://github.com/mpickpt/mana)
  • Build used the provided mpicc_mana wrapper to compile apps.

Problem:
A simple two-process MPI hello world program can be launched and checkpointed successfully using mana_launch and mana_coordinator. However, when attempting to restart the job with mana_restart under mpirun, the MPI processes are immediately killed by Hydra with exit code 9 (SIGKILL). The lower-half reports that memory has been restored and the crash appears to occur inside the postRestart callback invoked by lower-half.

Reproduction steps:

  1. Install MPICH 3.3.2 as shown below:
    ./configure --prefix=/usr/local --with-device=ch3:sock
    make && make install
    
  2. Clone MANA and build it:
    ./configure-mana
    make
    
  3. Compile the test program: ```sh
    ./bin/mpicc_mana ./mpi-proxy-split/test/mpi_hello_world.c -o mpi_hello_world
    ldd mpi_hello_world
    % linux-vdso.so.1 => (0x00007fffcc1da000)
    % libmpistub.so => /home/artur/mana/lib/dmtcp/libmpistub.so (0x00007f2b5e165000)
    % libc.so.6 => /lib64/libc.so.6 (0x00007f2b5db90000)
    % /lib64/ld-linux-x86-64.so.2 (0x00007f2b5df5e000)
  4. Start a coordinator:
    ./bin/mana_coordinator --verbose
  5. Launch the application and checkpoint manually:
    mpirun -np 2 ./bin/mana_launch --timing --verbose ./mpi_hello_world
    # wait for the program to print its hello messages and sleep
    # in second console, send checkpoint:
    dmtcp> c
    dmtcp> k   # kill the original run
  6. Attempt restart:
    strace -ff -s256 -o mpirun_restart.%p.log -e trace=network,desc,process,dup2,close,execve,signal \
        mpirun -np 2 ./bin/mana_restart --verbose 2>&1 | tee restart_run.log

Observed behavior:

  • The lower-half prints debug messages showing memory mapping and "Finished restoring memory data for process with PID X" for each rank.
  • Immediately after restoration, Hydra reports:
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   PID 12650 RUNNING AT localhost.localdomain
    =   EXIT CODE: 9
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    =
    YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
    
  • No segmentation faults or file descriptor errors appear in the strace logs; sockets and descriptors seem to be restored correctly.
  • The crash seems to happen immediately after the postRestart callback (postRestart(0,0)) is invoked in lower-half code.

Hypothesis / analysis:

  • The checkpoint file and memory maps appear valid: the lower-half reports complete restoration.
  • Hydra (the MPIRUN launcher in MPICH) is killing the process. This could indicate the process is not responding to Hydra's handshake (perhaps due to unexpected state after restoration) or the process calling exit(9) itself.
  • There is no obvious segfault; the postRestart handler may be returning to corrupted state or performing an operation that Hydra interprets as failure.

Strace attachments:
See attached mpirun_restart.*.log files generated by strace -ff. They contain the system call traces leading up to the kill. (Logs were added with additional DEBUG prints during restore.)

Request for help:

  1. Can the developers reproduce this basic restart failure on CentOS 7 / MPICH 3.3.2, or is it specific to my configuration?
  2. Is there a known issue with Hydra killing restarted processes? What conditions cause Hydra to SIGKILL a process after checkpoint/restore?
  3. Are there additional debug points that should be added around postRestart or in the MPI wrapper to understand why Hydra is unhappy?
  4. If this is a bug in MANA's lower-half or in coordinator/restart interaction with MPICH, guidance on a fix would be appreciated.

Additional notes:

  • The simple mpi_hello_world program is identical to the one supplied in mpi-proxy-split/test/.
  • The failure occurs on every restart attempt; the original launch and checkpoint stages succeed reliably.
  • I can provide complete log files or run additional instrumentation as needed.

Thank you for any assistance resolving this restart problem. A successful restart should "just work" for this simple case, so this seems like a regression or configuration issue.

mpirun_restart.12644.log
mpirun_restart.12648.log
mpirun_restart.12649.log
mpirun_restart.12650.log
mpirun_restart.12651.log
mpirun_restart.12652.log
mpirun_restart.12653.log
mpirun_restart.12654.log
mpirun_restart.12655.log
mpirun_restart.12656.log
mpirun_restart.12657.log
mpirun_restart.12658.log
restart_run.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions