-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
If the model (e.g. echam or any component of the coupled setup) crashes in the fortran code, e.g.
569: echam6 0000000000AD8051 MAIN__ 270 echam6.f90
569: echam6 00000000004178E2 Unknown Unknown Unknown
569: libc-2.17.so 00002AAAAE4AC555 __libc_start_main Unknown Unknown
569: echam6 00000000004177E9 Unknown Unknown Unknown
srun: error: gcn2459: task 473: Exited with exit code 66
srun: error: gcn2443: tasks 12,36,60,84,108,132,156,180,204,228,276,372,396,420,444,468,564: Exited with exit code 66
srun: error: gcn2445: tasks 13,37,61,85,109,133,157,181,205,229,277,373,397,421,445,469,493,565: Exited with exit code 66
srun: error: gcn2447: tasks 14,38,62,86,110,134,158,182,206,230,254,398,410,422,434,446,458,470,494,566: Exited with exit code 66
srun: error: gcn2449: tasks 15,27,39,63,87,111,135,147,159,171,183,207,219,231,255,387,399,411,423,435,447,471,495,567: Exited with exit code 66
srun: error: gcn2455: tasks 16,40,64,76,88,100,112,124,136,148,160,172,184,196,208,220,232,244,256,364,388,400,412,424,436,448,472,484,496,568: Exited with exit code 66
srun: error: gcn2469: tasks 10,22,34,46,58,70,82,94,106,130,142,154,166,178,190,202,214,226,238,370,394,418,442,454,466,478,490,502,526,562,574: Exited with exit code 66
srun: error: gcn2471: tasks 11,23,35,47,59,71,83,95,107,131,155,167,179,191,203,215,227,239,275,371,395,419,443,455,467,479,491,503,527,563: Exited with exit code 66
srun: error: gcn2465: tasks 8,20,32,44,56,68,80,92,104,128,140,152,164,176,188,200,212,224,236,248,320,344,368,392,416,440,452,464,476,500,524,548,560,572: Exited with exit code 66
srun: error: gcn2463: tasks 7,19,31,43,55,67,79,91,103,115,127,139,151,163,175,187,199,211,223,235,247,271,319,343,367,391,415,427,439,451,463,475,499,523,571: Exited with exit code 66
srun: error: gcn2461: tasks 6,18,30,42,54,66,78,90,102,114,126,138,150,162,174,186,198,210,222,234,246,270,294,318,342,366,390,414,426,438,450,462,474,486,498,522,534,570: Exited with exit code 66
0: slurmstepd: error: *** STEP 2965369.0 ON gcn2443 CANCELLED AT 2021-08-19T12:23:10 ***
691: forrtl: error (78): process killed (SIGTERM)
691: Image PC Routine Line Source
691: oceanx 00000000017C6574 Unknown Unknown Unknown
691: libpthread-2.17.s 00002AAAADE18630 Unknown Unknown Unknown
esm_runscripts continues and tries to move files, set's up the next leg of the run etc. In bash something like
if [[ $? -ne 0 ]] ; then
tell me there is an error and stop
fi
(of course we need to handle echam's possible return code of 127) would do the trick. This has been an issue for us ever since and can be annoying at times.
Is there a way we can solve this in esm_runscripts?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels