Skip to content

Add retry logic to sacctmgr ping in start-services.sh#509

Open
azreenz wants to merge 2 commits into
masterfrom
azreenzaman/sacct-retry
Open

Add retry logic to sacctmgr ping in start-services.sh#509
azreenz wants to merge 2 commits into
masterfrom
azreenzaman/sacct-retry

Conversation

@azreenz
Copy link
Copy Markdown
Collaborator

@azreenz azreenz commented May 21, 2026

Start-services.sh would exit with error code when pinging sacctmgr initially but then exits successfully after second pass of jetpack converge if database is in a different region and there is high latency. Added retry logic to prevent cluster from going red initially during converge

Copilot AI review requested due to automatic review settings May 21, 2026 17:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates start-services.sh to reduce false startup failures when slurmdbd is slow to become responsive (e.g., due to cross-region DB latency) by retrying sacctmgr ping before failing converge.

Changes:

  • Add a retry loop around sacctmgr ping after starting slurmdbd via systemctl.
  • Improve the failure message to indicate retries were attempted.
Criterion Max Points Points Awarded Notes
PR Description Accuracy 20 20 Description matches the change (retrying initial sacctmgr ping).
PR Atomicity 20 20 Single focused change.
Logical Implementation 10 10 Retry approach is straightforward.
Regression Risk 10 0 Current $? usage after the loop is fragile/misleading and could regress diagnosability/behavior if the loop body changes.
Exception Handling 10 0 The final failure condition does not reliably reference the sacctmgr ping exit status.
Code Comments 10 10 Comment accurately describes intent.
Repetitive Code 10 10 No meaningful duplication introduced.
Spelling 5 5 No spelling issues found in the diff.
Logging Quality 5 5 Messages are clear enough for operators.

FINAL SCORE: 80/100

RECOMMENDATION: MERGE WITH FOLLOW-UPS
RATIONALE: The retry behavior aligns with the PR goal, but the exit-status handling should be tightened to ensure the code is checking and reporting the real sacctmgr ping result.
BLOCKERS:

  • Fix the post-loop exit-status handling so it is based on the sacctmgr ping result (not the exit code of a subsequent [ ... ] test).

Comment thread azure-slurm-install/start-services.sh Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +27 to 29
if [ "$i" == "$attempts" ] && [ "$ping_rc" -ne 0 ]; then
echo "ERROR: slurmdbd started but is not responding to sacctmgr ping after $attempts attempts"
exit 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants