Skip to content

RFC Adding Job ID to Docker runs #426

@tatarsky

Description

@tatarsky

Folks I've been finding LOTS of orphaned dockers out on the nodes lately. Killing them off is somewhat automated but due to my being a very careful person I still manually review the process before I kill anything.

The main method I use is try to trace it to a job in progress on the node and look at docker top. Most of the orphans BTW seem to be sitting there running bash. And tensorflow is a main image right now in this state.

I think one very helpful item to confirm my method would be if people would add to their docker execution script via the --name argument to docker run the contents of the qsub $PBS_JOBID variable.

Aka:

docker run --name $PBS_JOBID something something

Or perhaps more elaborate:

docker run --name "$USER-$PBS_JOBID"

Just something besides the random generated name to help me determine the state of the orphan.

It would be really helpful assuming the root cause of this cannot be fixed or its sporadic. I believe when we discussed this once in some cases the signal to end the docker run was not making it all the way to the docker image due to the way it was run. Can dig out the old Git discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions