@steven-albanese and I encountered a weird issue over the weekend. Looking at our logs, it looks like a job [6400657] running on 4 nodes (using 4 GPUs/node) slowed down tremendously at some point:
2015-11-21 05:54:18,834: Iteration took 32.709 s.
2015-11-21 05:54:51,243: Iteration took 32.403 s.
2015-11-21 05:55:24,073: Iteration took 32.824 s.
2015-11-21 05:55:55,827: Iteration took 31.747 s.
2015-11-21 05:56:26,363: Iteration took 30.530 s.
2015-11-21 05:56:58,877: Iteration took 32.507 s.
2015-11-21 05:57:31,882: Iteration took 32.999 s.
2015-11-21 05:58:02,966: Iteration took 31.077 s.
2015-11-21 05:59:51,511: Iteration took 108.539 s.
2015-11-21 06:01:33,997: Iteration took 102.479 s.
2015-11-21 06:03:15,544: Iteration took 101.541 s.
2015-11-21 06:04:58,324: Iteration took 102.773 s.
2015-11-21 06:06:44,061: Iteration took 105.731 s.
2015-11-21 06:08:25,503: Iteration took 101.436 s.
2015-11-21 06:10:07,424: Iteration took 101.914 s.
2015-11-21 06:11:48,712: Iteration took 101.282 s.
and even more later on:
2015-11-23 13:20:23,487: Iteration took 142.929 s.
2015-11-23 13:23:39,080: Iteration took 195.587 s.
2015-11-23 13:36:03,349: Iteration took 744.263 s.
2015-11-23 13:51:24,746: Iteration took 921.390 s.
2015-11-23 14:08:36,148: Iteration took 1031.396 s.
2015-11-23 14:27:49,693: Iteration took 1153.540 s.
2015-11-23 14:46:24,388: Iteration took 1114.689 s.
2015-11-23 15:06:27,059: Iteration took 1202.664 s.
2015-11-23 15:26:42,597: Iteration took 1215.532 s.
though it gets faster again later
2015-11-23 17:00:52,199: Iteration took 141.521 s.
2015-11-23 17:03:17,178: Iteration took 144.973 s.
2015-11-23 17:05:57,265: Iteration took 160.080 s.
The execution time per iteration should vary maybe as much as ~10% each iteration, but not orders of magnitude like this. Because it's a symmetric multiprocessing code---each iteration waits for all GPUs to complete their uniform-size tasks---it leads me to suspect someone was accidentally running on a GPU out of their queue allocation, which would have the effect of greatly slowing down @steven-albanese's job. (The GPUs run in "shared" mode in our scripts, permitting this.)
We'll do some more careful monitoring to see if we can establish this really is the cause and not something else. We'll also try "exclusive" mode, though I recall we had problems with this at some point (though I can't quite recall why).
@steven-albanese and I encountered a weird issue over the weekend. Looking at our logs, it looks like a job [6400657] running on 4 nodes (using 4 GPUs/node) slowed down tremendously at some point:
and even more later on:
though it gets faster again later
The execution time per iteration should vary maybe as much as ~10% each iteration, but not orders of magnitude like this. Because it's a symmetric multiprocessing code---each iteration waits for all GPUs to complete their uniform-size tasks---it leads me to suspect someone was accidentally running on a GPU out of their queue allocation, which would have the effect of greatly slowing down @steven-albanese's job. (The GPUs run in "shared" mode in our scripts, permitting this.)
We'll do some more careful monitoring to see if we can establish this really is the cause and not something else. We'll also try "exclusive" mode, though I recall we had problems with this at some point (though I can't quite recall why).