Skip to content

esm_runscripts does not stop on model crashes #179

@seb-wahl

Description

@seb-wahl

If the model (e.g. echam or any component of the coupled setup) crashes in the fortran code, e.g.

569: echam6             0000000000AD8051  MAIN__                    270  echam6.f90
 569: echam6             00000000004178E2  Unknown               Unknown  Unknown
 569: libc-2.17.so       00002AAAAE4AC555  __libc_start_main     Unknown  Unknown
 569: echam6             00000000004177E9  Unknown               Unknown  Unknown
srun: error: gcn2459: task 473: Exited with exit code 66
srun: error: gcn2443: tasks 12,36,60,84,108,132,156,180,204,228,276,372,396,420,444,468,564: Exited with exit code 66
srun: error: gcn2445: tasks 13,37,61,85,109,133,157,181,205,229,277,373,397,421,445,469,493,565: Exited with exit code 66
srun: error: gcn2447: tasks 14,38,62,86,110,134,158,182,206,230,254,398,410,422,434,446,458,470,494,566: Exited with exit code 66
srun: error: gcn2449: tasks 15,27,39,63,87,111,135,147,159,171,183,207,219,231,255,387,399,411,423,435,447,471,495,567: Exited with exit code 66
srun: error: gcn2455: tasks 16,40,64,76,88,100,112,124,136,148,160,172,184,196,208,220,232,244,256,364,388,400,412,424,436,448,472,484,496,568: Exited with exit code 66
srun: error: gcn2469: tasks 10,22,34,46,58,70,82,94,106,130,142,154,166,178,190,202,214,226,238,370,394,418,442,454,466,478,490,502,526,562,574: Exited with exit code 66
srun: error: gcn2471: tasks 11,23,35,47,59,71,83,95,107,131,155,167,179,191,203,215,227,239,275,371,395,419,443,455,467,479,491,503,527,563: Exited with exit code 66
srun: error: gcn2465: tasks 8,20,32,44,56,68,80,92,104,128,140,152,164,176,188,200,212,224,236,248,320,344,368,392,416,440,452,464,476,500,524,548,560,572: Exited with exit code 66
srun: error: gcn2463: tasks 7,19,31,43,55,67,79,91,103,115,127,139,151,163,175,187,199,211,223,235,247,271,319,343,367,391,415,427,439,451,463,475,499,523,571: Exited with exit code 66
srun: error: gcn2461: tasks 6,18,30,42,54,66,78,90,102,114,126,138,150,162,174,186,198,210,222,234,246,270,294,318,342,366,390,414,426,438,450,462,474,486,498,522,534,570: Exited with exit code 66
   0: slurmstepd: error: *** STEP 2965369.0 ON gcn2443 CANCELLED AT 2021-08-19T12:23:10 ***
 691: forrtl: error (78): process killed (SIGTERM)
 691: Image              PC                Routine            Line        Source
 691: oceanx             00000000017C6574  Unknown               Unknown  Unknown
 691: libpthread-2.17.s  00002AAAADE18630  Unknown               Unknown  Unknown

esm_runscripts continues and tries to move files, set's up the next leg of the run etc. In bash something like

if [[ $? -ne 0 ]] ; then
   tell me there is an error and stop
fi

(of course we need to handle echam's possible return code of 127) would do the trick. This has been an issue for us ever since and can be annoying at times.
Is there a way we can solve this in esm_runscripts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions