Add retry logic to sacctmgr ping in start-services.sh#509
Open
azreenz wants to merge 2 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates start-services.sh to reduce false startup failures when slurmdbd is slow to become responsive (e.g., due to cross-region DB latency) by retrying sacctmgr ping before failing converge.
Changes:
- Add a retry loop around
sacctmgr pingafter startingslurmdbdviasystemctl. - Improve the failure message to indicate retries were attempted.
| Criterion | Max Points | Points Awarded | Notes |
|---|---|---|---|
| PR Description Accuracy | 20 | 20 | Description matches the change (retrying initial sacctmgr ping). |
| PR Atomicity | 20 | 20 | Single focused change. |
| Logical Implementation | 10 | 10 | Retry approach is straightforward. |
| Regression Risk | 10 | 0 | Current $? usage after the loop is fragile/misleading and could regress diagnosability/behavior if the loop body changes. |
| Exception Handling | 10 | 0 | The final failure condition does not reliably reference the sacctmgr ping exit status. |
| Code Comments | 10 | 10 | Comment accurately describes intent. |
| Repetitive Code | 10 | 10 | No meaningful duplication introduced. |
| Spelling | 5 | 5 | No spelling issues found in the diff. |
| Logging Quality | 5 | 5 | Messages are clear enough for operators. |
FINAL SCORE: 80/100
RECOMMENDATION: MERGE WITH FOLLOW-UPS
RATIONALE: The retry behavior aligns with the PR goal, but the exit-status handling should be tightened to ensure the code is checking and reporting the real sacctmgr ping result.
BLOCKERS:
- Fix the post-loop exit-status handling so it is based on the
sacctmgr pingresult (not the exit code of a subsequent[ ... ]test).
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines
+27
to
29
| if [ "$i" == "$attempts" ] && [ "$ping_rc" -ne 0 ]; then | ||
| echo "ERROR: slurmdbd started but is not responding to sacctmgr ping after $attempts attempts" | ||
| exit 2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Start-services.sh would exit with error code when pinging sacctmgr initially but then exits successfully after second pass of jetpack converge if database is in a different region and there is high latency. Added retry logic to prevent cluster from going red initially during converge