Skip to content

STAT/LaunchMON occasionally hangs #44

@dongahn

Description

@dongahn

As noted #43, @lee218llnl: reported.

Occasionally I get hangs with STAT, particularly after running it multiple times. It appears to be in lmon__fe.cxx on line 4601 in a pthread_cond_timedwait. I don’t know if this is an actual affect or just a correlation, but it seems like if I subsequently attach TV to the job and detach TV, then I am able to attach again with STAT.

I have also seen hang-like behavior (looping) in cobo on cobo_connect_hostname. This also appears to happen if I aggressively attach/detach/attach STAT multiple times.

I suspect 1 is due to FIFO handling within jsrun but I need a simple reproducer to prove or disprove myself.

I suspect 2 is due to a problem with colocation service within jsrun, but I need a simple reproducer to prove or disprove myself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions