Skip to content

Deadlock when process killed #109

@phargogh

Description

@phargogh

When a subprocess managed by taskgraph is killed by the operating system, the pool will automatically spawn a new process, but not the events that have been allocated to that process. Because of this, the graph deadlocks waiting on events that will never be triggered.

To reproduce:

  1. Create a new graph in multiprocessed mode (n_workers >= 1)
  2. Execute a task
  3. Kill that task before it completes
  4. Observe graph hanging

A practical way to trigger this is to use a memory-constrained environment such as Sherlock. On Sherlock, just make sure we have at least 1 task that uses more memory than we have requested for the SLURM job.

Although I suppose it might be ideal to have the appropriate events recreated so the graph can continue to execute, I think it might be better to simply detect that the process has been terminated and then terminate the graph.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions