MANA Restart Failure on CentOS 7.9

**Environment:**
- OS: CentOS Linux 7.9.2009 (kernel 3.10.0-1160.119.1.el7.x86_64)
- MPI implementation: MPICH 3.3.2 (built with `--with-device=ch3:sock`)
- MANA commit: current `master` (private copy from https://github.com/mpickpt/mana)
- Build used the provided `mpicc_mana` wrapper to compile apps.

**Problem:**
A simple two-process MPI hello world program can be launched and checkpointed successfully using `mana_launch` and `mana_coordinator`.  However, when attempting to restart the job with `mana_restart` under `mpirun`, the MPI processes are immediately killed by Hydra with exit code `9` (SIGKILL).  The lower-half reports that memory has been restored and the crash appears to occur inside the `postRestart` callback invoked by `lower-half`.

**Reproduction steps:**

1. Install MPICH 3.3.2 as shown below:  
   ```
   ./configure --prefix=/usr/local --with-device=ch3:sock
   make && make install
   ```
2. Clone MANA and build it:
   ```
   ./configure-mana
   make
   ```
4. Compile the test program:   ```sh
   ./bin/mpicc_mana ./mpi-proxy-split/test/mpi_hello_world.c -o mpi_hello_world
   ldd mpi_hello_world
    % linux-vdso.so.1 =>  (0x00007fffcc1da000)
    % libmpistub.so => /home/artur/mana/lib/dmtcp/libmpistub.so (0x00007f2b5e165000)
    % libc.so.6 => /lib64/libc.so.6 (0x00007f2b5db90000)
    % /lib64/ld-linux-x86-64.so.2 (0x00007f2b5df5e000)
   ```
4. Start a coordinator:
   ```sh
   ./bin/mana_coordinator --verbose
   ```
5. Launch the application and checkpoint manually:
   ```sh
   mpirun -np 2 ./bin/mana_launch --timing --verbose ./mpi_hello_world
   # wait for the program to print its hello messages and sleep
   # in second console, send checkpoint:
   dmtcp> c
   dmtcp> k   # kill the original run
   ```
6. Attempt restart:
   ```sh
   strace -ff -s256 -o mpirun_restart.%p.log -e trace=network,desc,process,dup2,close,execve,signal \
       mpirun -np 2 ./bin/mana_restart --verbose 2>&1 | tee restart_run.log
   ```

**Observed behavior:**
- The lower-half prints debug messages showing memory mapping and "Finished restoring memory data for process with PID X" for each rank.
- Immediately after restoration, Hydra reports:
  ```
  =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
  =   PID 12650 RUNNING AT localhost.localdomain
  =   EXIT CODE: 9
  =   CLEANING UP REMAINING PROCESSES
  =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
  =
  YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
  ```
- No segmentation faults or file descriptor errors appear in the strace logs; sockets and descriptors seem to be restored correctly.
- The crash seems to happen immediately after the `postRestart` callback (`postRestart(0,0)`) is invoked in lower-half code.

**Hypothesis / analysis:**
- The checkpoint file and memory maps appear valid: the lower-half reports complete restoration.
- Hydra (the MPIRUN launcher in MPICH) is killing the process.  This could indicate the process is not responding to Hydra's handshake (perhaps due to unexpected state after restoration) or the process calling `exit(9)` itself.
- There is no obvious segfault; the `postRestart` handler may be returning to corrupted state or performing an operation that Hydra interprets as failure.

**Strace attachments:**
See attached `mpirun_restart.*.log` files generated by `strace -ff`.  They contain the system call traces leading up to the kill.  (Logs were added with additional DEBUG prints during restore.)

**Request for help:**
1. Can the developers reproduce this basic restart failure on CentOS 7 / MPICH 3.3.2, or is it specific to my configuration?
2. Is there a known issue with Hydra killing restarted processes?  What conditions cause Hydra to SIGKILL a process after checkpoint/restore?
3. Are there additional debug points that should be added around `postRestart` or in the MPI wrapper to understand why Hydra is unhappy?
4. If this is a bug in MANA's lower-half or in coordinator/restart interaction with MPICH, guidance on a fix would be appreciated.

**Additional notes:**
- The simple `mpi_hello_world` program is identical to the one supplied in `mpi-proxy-split/test/`.
- The failure occurs on every restart attempt; the original launch and checkpoint stages succeed reliably.
- I can provide complete log files or run additional instrumentation as needed.

Thank you for any assistance resolving this restart problem.  A successful restart should "just work" for this simple case, so this seems like a regression or configuration issue.

[mpirun_restart.12644.log](https://github.com/user-attachments/files/25651723/mpirun_restart.12644.log)
[mpirun_restart.12648.log](https://github.com/user-attachments/files/25651731/mpirun_restart.12648.log)
[mpirun_restart.12649.log](https://github.com/user-attachments/files/25651732/mpirun_restart.12649.log)
[mpirun_restart.12650.log](https://github.com/user-attachments/files/25651730/mpirun_restart.12650.log)
[mpirun_restart.12651.log](https://github.com/user-attachments/files/25651729/mpirun_restart.12651.log)
[mpirun_restart.12652.log](https://github.com/user-attachments/files/25651722/mpirun_restart.12652.log)
[mpirun_restart.12653.log](https://github.com/user-attachments/files/25651725/mpirun_restart.12653.log)
[mpirun_restart.12654.log](https://github.com/user-attachments/files/25651726/mpirun_restart.12654.log)
[mpirun_restart.12655.log](https://github.com/user-attachments/files/25651734/mpirun_restart.12655.log)
[mpirun_restart.12656.log](https://github.com/user-attachments/files/25651728/mpirun_restart.12656.log)
[mpirun_restart.12657.log](https://github.com/user-attachments/files/25651733/mpirun_restart.12657.log)
[mpirun_restart.12658.log](https://github.com/user-attachments/files/25651727/mpirun_restart.12658.log)
[restart_run.log](https://github.com/user-attachments/files/25651724/restart_run.log)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MANA Restart Failure on CentOS 7.9 #468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MANA Restart Failure on CentOS 7.9 #468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions