replaced squeue calls with sacct for more robust interface#100
replaced squeue calls with sacct for more robust interface#100
Conversation
| out_search = re.search(out_pattern, str(squeue_output)) | ||
| if out_search: | ||
| return out_search.group(1) | ||
| # state_command = f'squeue -j {str(jobid)} -o "%T"' |
There was a problem hiding this comment.
Can we erase old code instead of commenting it? If it's ever needed again, we can always go back in the git history and figure out what we had before
| # 29319673|COMPLETED| | ||
| # 29319673.batch|COMPLETED| | ||
| # 29319673.0|COMPLETED| | ||
| pattern = f'{jobid}\|([A-Z]+)\|' |
There was a problem hiding this comment.
jobid is the slurm job number, or the slurm job name? What if someone has the same name for multiple jobs? I don't think slurm technically forbids having the same name twice (even if it is confusing)
Looking at the function header, it seems that is the slurm job number, which corresponds to:
SLURM_JOB_ID (and SLURM_JOBID for backwards compatibility)
The ID of the job allocation.
Maybe we can add a link in the docstring and refer to the section "OUTPUT ENVIRONMENT VARIABLES" here: https://slurm.schedmd.com/sbatch.html
| # deniz: sacct is much better and persistent compared to squeue. Also | ||
| # getoutput returns standard strings compared to byte strings. This | ||
| # allows easier regex | ||
| command = f'sacct -j {str(jobid)} --parsable --format=jobid,State' |
There was a problem hiding this comment.
This looks much cleaner than what I hacked together this morning for one of my utilities:
squeue -u $(whoami) -o "%Z %30j %T %M" | grep ${PROJECT_BASE} | cut -f 2- -d' ' | sortI wish I had learned sacct earlier... 😭
|
|
||
| @staticmethod | ||
| def job_is_still_running(jobid): | ||
| """Returns a boolean if the job is still running""" |
| def job_is_still_running(jobid): | ||
| """Returns a boolean if the job is still running""" | ||
| return psutil.pid_exists(jobid) | ||
| # """Returns a boolean if the job is still running""" |
There was a problem hiding this comment.
Same as above, maybe we can directly remove old code rather than just having it commented, that also might make looking at diffs easier.
As we have talked today with @mandresm, step 1 of the more robust Slurm interface is done.
Basically, these additions use more persistent
sacctcommand instead of thesqueuecommands.I also replaced some of the
subprocesscalls for easier regex parsing.Of course, it is important to consider the recent chunk works from @dbarbi before taking everything for granted.
I tested these functions in a unit test suite. Here are the results:
Some takeaways for @dbarbi,
I am not happy with the
esm_runscriptssubmitting thetidyjob even if the run fails for some reason. Eg. first month fails andesm_runscriptskeeps re-submitting the later months. There is neat way of overcomming this. That was the first step in doing that.In the second stage, I will implement a check in the
tidymethod that will look for the output of the previous Slurm job and then it will decide to submit the next job or not.