Skip to content

Accidental usage of GPUs? #342

@jchodera

Description

@jchodera

@steven-albanese and I encountered a weird issue over the weekend. Looking at our logs, it looks like a job [6400657] running on 4 nodes (using 4 GPUs/node) slowed down tremendously at some point:

2015-11-21 05:54:18,834: Iteration took 32.709 s.
2015-11-21 05:54:51,243: Iteration took 32.403 s.
2015-11-21 05:55:24,073: Iteration took 32.824 s.
2015-11-21 05:55:55,827: Iteration took 31.747 s.
2015-11-21 05:56:26,363: Iteration took 30.530 s.
2015-11-21 05:56:58,877: Iteration took 32.507 s.
2015-11-21 05:57:31,882: Iteration took 32.999 s.
2015-11-21 05:58:02,966: Iteration took 31.077 s.
2015-11-21 05:59:51,511: Iteration took 108.539 s.
2015-11-21 06:01:33,997: Iteration took 102.479 s.
2015-11-21 06:03:15,544: Iteration took 101.541 s.
2015-11-21 06:04:58,324: Iteration took 102.773 s.
2015-11-21 06:06:44,061: Iteration took 105.731 s.
2015-11-21 06:08:25,503: Iteration took 101.436 s.
2015-11-21 06:10:07,424: Iteration took 101.914 s.
2015-11-21 06:11:48,712: Iteration took 101.282 s.

and even more later on:

2015-11-23 13:20:23,487: Iteration took 142.929 s.
2015-11-23 13:23:39,080: Iteration took 195.587 s.
2015-11-23 13:36:03,349: Iteration took 744.263 s.
2015-11-23 13:51:24,746: Iteration took 921.390 s.
2015-11-23 14:08:36,148: Iteration took 1031.396 s.
2015-11-23 14:27:49,693: Iteration took 1153.540 s.
2015-11-23 14:46:24,388: Iteration took 1114.689 s.
2015-11-23 15:06:27,059: Iteration took 1202.664 s.
2015-11-23 15:26:42,597: Iteration took 1215.532 s.

though it gets faster again later

2015-11-23 17:00:52,199: Iteration took 141.521 s.
2015-11-23 17:03:17,178: Iteration took 144.973 s.
2015-11-23 17:05:57,265: Iteration took 160.080 s.

The execution time per iteration should vary maybe as much as ~10% each iteration, but not orders of magnitude like this. Because it's a symmetric multiprocessing code---each iteration waits for all GPUs to complete their uniform-size tasks---it leads me to suspect someone was accidentally running on a GPU out of their queue allocation, which would have the effect of greatly slowing down @steven-albanese's job. (The GPUs run in "shared" mode in our scripts, permitting this.)

We'll do some more careful monitoring to see if we can establish this really is the cause and not something else. We'll also try "exclusive" mode, though I recall we had problems with this at some point (though I can't quite recall why).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions