Skip to content

Commit ec79c6c

Browse files
committed
A hack-ish fix to DMTCP not unmapping mtcp_restart
* On second checkpoint (ckpt-restart-ckpt), The checkpoint image keeps a copy of mtcp_restart in its checkpoint image. * On the second restart, we try to restore the mtcp_restart of the ckpt image on top of the current mtcp_restart used for restart. This then segfaults, due to failure of mmap with MAP_FIXED_NOREPLACE. * So now we munmap it within MANA.
1 parent f2f7518 commit ec79c6c

1 file changed

Lines changed: 32 additions & 0 deletions

File tree

restart_plugin/mtcp_restart_plugin.c

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -682,6 +682,15 @@ mtcp_plugin_hook(RestoreInfo *rinfo)
682682
void
683683
mtcp_plugin_hook(RestoreInfo *rinfo)
684684
{
685+
// FIXME: DMTCP should remove text/data/heap of mtcp_restart.
686+
// For now, MANA has this workaround, in conjunction with
687+
// mpi-proxy-split/mpi_plugin.cpp:computeUnionOfCkptImageAddresses
688+
// When mtcp_restart starts at main, there is already a heap.
689+
// In case anyone else calls sbrk(), this will create a gap after
690+
// the text/data/heap of mtcp_restart. So, computeUnionOfCkptImageAddresses
691+
// will munmap the text/data/heap, but nothing more.
692+
mtcp_sys_brk((char *)0x11200000 + 0x30000);
693+
685694
remap_vdso_and_vvar_regions(rinfo);
686695
mysetauxval(rinfo->environ, AT_SYSINFO_EHDR,
687696
(unsigned long int) rinfo->currentVdsoStart);
@@ -845,6 +854,29 @@ mtcp_plugin_hook(RestoreInfo *rinfo)
845854
int
846855
mtcp_plugin_skip_memory_region_munmap(Area *area, RestoreInfo *rinfo)
847856
{
857+
// FIXME: All of this is a temporary workaround, until the DMTCP restart
858+
// plugin can be re-designed. See the conversation in PR #357.
859+
// After the DMTCP re-design, we should delete all of the code
860+
// of this paragraph.
861+
// NOTE: 0x11200000 is the address for mtcp_restart.
862+
// See LINKER_FLAGS= -Wl,-Ttext-segment=11200000
863+
// in dmtcp/src/mtcp/Makefile, for why this hard-wired addess exists.
864+
// NOTE: This is the originally loaded mtcp_restart (text/data/heap),
865+
// before we copied it to the DMTCP "hole" and execute from there.
866+
if (is_overlap(area->addr, area->endAddr,
867+
(char *)0x11200000, (char *)0x11200000 + 0x30000)) {
868+
// Range [0x11200000, nextPageAddr] should cover mtcp_restart text/data/heap
869+
void *nextPageAddr = (char *)0x11200000 + 0x30000;
870+
void *testIfEmpty = mtcp_sys_mmap(nextPageAddr, 4096,
871+
PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
872+
// MANA panic: The next page after the assumed mtcp_restart memory regions
873+
// was occupied. Is mtcp_restart larger than expected.
874+
MTCP_ASSERT(testIfEmpty == nextPageAddr);
875+
mtcp_sys_munmap(nextPageAddr, 4096); // The test passed. Free it again.
876+
mtcp_sys_munmap((void *)0x11200000, 0x30000); // Unmap the old mtcp_restart.
877+
return 0;
878+
}
879+
848880
LowerHalfInfo_t *lh_info = &rinfo->pluginInfo;
849881
LhCoreRegions_t *lh_regions_list = NULL;
850882
int total_lh_regions = lh_info->numCoreRegions;

0 commit comments

Comments
 (0)