CASM-5771 Fix for RR Ceph issue#6537
Draft
sravani-sanigepalli wants to merge 5 commits intorelease/1.7from
Draft
CASM-5771 Fix for RR Ceph issue#6537sravani-sanigepalli wants to merge 5 commits intorelease/1.7from
sravani-sanigepalli wants to merge 5 commits intorelease/1.7from
Conversation
jpdavis-prof
reviewed
Mar 4, 2026
Comment on lines
+146
to
+151
| echo "Recreating secret..." | ||
|
|
||
| kubectl delete secret -n loftsman site-init --ignore-not-found | ||
|
|
||
| kubectl create secret -n loftsman generic site-init \ | ||
| --from-file="${tmpdir}/customizations.yaml" |
Contributor
There was a problem hiding this comment.
Suggested change
| echo "Recreating secret..." | |
| kubectl delete secret -n loftsman site-init --ignore-not-found | |
| kubectl create secret -n loftsman generic site-init \ | |
| --from-file="${tmpdir}/customizations.yaml" | |
| echo "Updating secret..." | |
| kubectl create secret -n loftsman generic site-init \ | |
| --from-file="${tmpdir}/customizations.yaml" \ | |
| --dry-run=client -o yaml | kubectl apply -f - |
| echo "Daemons still transitioning..." | ||
|
|
||
| # Ensure all MONs are in quorum | ||
| elif ! ceph quorum_status --format json | grep -q '"quorum_names"'; then |
Contributor
There was a problem hiding this comment.
I think the "quorum_names" key always exists in the output, so I think this is just confirming that the key exists, not that "all MONs are in quorum".
Comment on lines
+163
to
+165
| updated_manifest=$(echo "$manifest_yaml" \ | ||
| | NEW_MONS="$new_mons" yq e \ | ||
| '.spec.charts[].values.cephExporter.endpoints = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -) |
Contributor
There was a problem hiding this comment.
I think this creates .values.cephExporter.endpoints on all charts, even charts that don't already have it.
Suggested change
| updated_manifest=$(echo "$manifest_yaml" \ | |
| | NEW_MONS="$new_mons" yq e \ | |
| '.spec.charts[].values.cephExporter.endpoints = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -) | |
| updated_manifest=$(echo "$manifest_yaml" \ | |
| | NEW_MONS="$new_mons" yq e \ | |
| '(.spec.charts[] | select(.values.cephExporter) | .values.cephExporter.endpoints) = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -) |
Comment on lines
+50
to
+52
| # Check for any non-running ceph daemons | ||
| if ceph orch ps | grep -E 'starting|stopped|error|unknown' > /dev/null; then | ||
| echo "Daemons still transitioning..." |
Contributor
There was a problem hiding this comment.
Suggested change
| # Check for any non-running ceph daemons | |
| if ceph orch ps | grep -E 'starting|stopped|error|unknown' > /dev/null; then | |
| echo "Daemons still transitioning..." | |
| # Check for any non-running ceph daemons | |
| if ceph orch ps --format json | jq -e '[.[].status_desc | select(test("starting|stopped|error|unknown"))] | length > 0' > /dev/null; then | |
| echo "Daemons still transitioning..." |
Comment on lines
+299
to
+305
| ```bash | ||
| for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do | ||
| if [ "$host" != "ncn-s001" ]; then | ||
| ceph orch host label rm $host _admin | ||
| fi | ||
| done | ||
| ``` |
Contributor
There was a problem hiding this comment.
Suggested change
| ```bash | |
| for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do | |
| if [ "$host" != "ncn-s001" ]; then | |
| ceph orch host label rm $host _admin | |
| fi | |
| done | |
| ``` | |
| ```bash | |
| for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do | |
| if [ "$host" != "ncn-s001" ]; then | |
| ceph orch host label rm "$host" _admin | |
| fi | |
| done | |
| ``` |
|
|
||
| ```bash | ||
| for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do | ||
| ceph orch host label add $host _admin |
Contributor
There was a problem hiding this comment.
Suggested change
| ceph orch host label add $host _admin | |
| ceph orch host label add "$host" _admin |
| ``` | ||
|
|
||
| The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to `configured`. | ||
| The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`. |
Contributor
There was a problem hiding this comment.
Suggested change
| The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`. | |
| The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to become `configured`. |
Contributor
|
This pull-request has not had activity in over 20 days and is being marked as stale. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Resolves
An issue observed on CSCS ALPS -
When Rack resiliency is enabled, canary storage node rollout (ncn-s001) changed the k8s monitors from 3 to 5 and picked up a completely new set of nodes to run monitors (from s001,2,3 to s008,9,10,11,12). It caused the following issues -
CEPH commands are hung on management master and storage nodes (except for ncn-s002 in case of ALPS)
Grafana dashboards shows wrong information
When k8s pods are restarted, they are stuck in init state as they are not able to mount PVCs (output added below)
This is because of storage nodes rollout (RR ansible plays) failing to update the ceph configuration on the storage nodes (/etc/ceph/ceph.conf) , customizations.yaml and the k8s config maps showed below:
Following changes are made as part of the fix :
Checklist
.github/CODEOWNERSwith the corresponding team in Cray-HPE.