Skip to content

Poseidon doesn't wait for allocation start #717

@MrSerth

Description

@MrSerth

Today at Oct 14, 2024 14:03:20 UTC, we saw another occurrence of POSEIDON-G

This time, we got the following message:

communication with executor failed: nomad error during file copy: error executing command in job 29-0b12ae9d-8a35-11ef-9212-fa163eb9b043: error executing command in allocation: task "default-task" not started yet.

Hence, we should investigate when this occurs and how to handle it properly. Luckily, the issue was only affecting a single learner and shortly resolved automatically (the task was started).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions