The CTS uses job state tracking to not only inform users about the state of their job, but also ensure that parallel requests to the CTS can't split a job into multiple executions, since job state completion endpoints aren't authorized to prevent the need to store CTS admin credentials off site. Only allowing a specific job state transition to occur once allows a specific CTS instance to "own" handling the next transition, and further transition requests will fail, preventing splitting the execution.
For the HTCondor based jobs, this state management was duplicated for the purposes of having uniform state communication to the user. However, the HTC jobs use different endpoints for job state updates than JAWS based jobs which are authenticated, and the HTC jobs manage their own state by informing the server of that state. For JAWS based jobs the server manages the job state and makes calls to the NERSC SFAPI after each state transition to start the next job phase.
Normally this doesn't cause any issues, but if HTCondor restarts a sub job, and the sub job has already transitioned into the job_submitting state, then the restart will always fail since it will try to transition to job_submitting and the server will disallow the transition, since the the job must be in the download_submitted state, and the subjob will fail.
This is obviously not ideal, and we should think about how to allow HTCondor to restart CTS jobs successfully. Relaxing the requirement for HTCondor jobs to transition exactly from one state to another is one possibility, but that does then allow for the same job running at the same time.
The CTS uses job state tracking to not only inform users about the state of their job, but also ensure that parallel requests to the CTS can't split a job into multiple executions, since job state completion endpoints aren't authorized to prevent the need to store CTS admin credentials off site. Only allowing a specific job state transition to occur once allows a specific CTS instance to "own" handling the next transition, and further transition requests will fail, preventing splitting the execution.
For the HTCondor based jobs, this state management was duplicated for the purposes of having uniform state communication to the user. However, the HTC jobs use different endpoints for job state updates than JAWS based jobs which are authenticated, and the HTC jobs manage their own state by informing the server of that state. For JAWS based jobs the server manages the job state and makes calls to the NERSC SFAPI after each state transition to start the next job phase.
Normally this doesn't cause any issues, but if HTCondor restarts a sub job, and the sub job has already transitioned into the job_submitting state, then the restart will always fail since it will try to transition to job_submitting and the server will disallow the transition, since the the job must be in the download_submitted state, and the subjob will fail.
This is obviously not ideal, and we should think about how to allow HTCondor to restart CTS jobs successfully. Relaxing the requirement for HTCondor jobs to transition exactly from one state to another is one possibility, but that does then allow for the same job running at the same time.