CASM-5771 Fix for RR Ceph issue by sravani-sanigepalli · Pull Request #6537 · Cray-HPE/docs-csm

sravani-sanigepalli · 2026-03-03T20:38:23Z

Description

Resolves

An issue observed on CSCS ALPS -

When Rack resiliency is enabled, canary storage node rollout (ncn-s001) changed the k8s monitors from 3 to 5 and picked up a completely new set of nodes to run monitors (from s001,2,3 to s008,9,10,11,12). It caused the following issues -

CEPH commands are hung on management master and storage nodes (except for ncn-s002 in case of ALPS)
Grafana dashboards shows wrong information
When k8s pods are restarted, they are stuck in init state as they are not able to mount PVCs (output added below)

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    14m                  default-scheduler  Successfully assigned services/cray-power-control-bitnami-etcd-2 to ncn-w003
  Warning  FailedMount  4m24s (x9 over 12m)  kubelet            MountVolume.MountDevice failed for volume "pvc-ce7deb3a-e97c-45c1-a931-2853a0cab7ac" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-4e2bc4d8-1658-11f1-b57e-42010afc0100-000000000000000b-c071f8a9-0125-4004-a852-f7ac928c6e2d already exists
  Warning  FailedMount  22s (x3 over 12m)    kubelet            MountVolume.MountDevice failed for volume "pvc-ce7deb3a-e97c-45c1-a931-2853a0cab7ac" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

This is because of storage nodes rollout (RR ansible plays) failing to update the ceph configuration on the storage nodes (/etc/ceph/ceph.conf) , customizations.yaml and the k8s config maps showed below:

kubectl -n ceph-rdb edit cm ceph-csi-config
kubectl -n ceph-cephfs edit cm ceph-csi-config
kubectl -n default edit cm ceph-csi-config
kubectl -n services edit cm ceph-csi-config
kubectl -n backups edit cm ceph-etc
kubectl edit -n loftsman cm loftsman-cray-sysmgmt-health
kubectl get secrets -n loftsman site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d > customizations.yaml

Following changes are made as part of the fix :

Adding _admin label to all storage nodes before starting storage node rollouts. With this, any configuration changes would automatically get updated on all the storage nodes (including the updating of /etc/ceph/ceph.conf). This change is reverted back once all the storage nodes rollout is finished.
Added a script after first storage node rollout which
1. Would wait for CEPH services to settle down (which are modified as part of the role rr.ceph_zoning in the rack_resiliency_for_mgmt_nodes.yml playbook)
2. Updates various configmaps with latest monitor configurations - This fixes pod mount issues in pod restarts and grafana dashboard issues.
3. Copies the latest ceph.conf to all master nodes
Added troubleshooting steps for a potential timing-related corner case that may result in the CFS session created during the canary storage node rollout becoming stuck.

Checklist

If I added any command snippets, the steps they belong to follow the prompt conventions (see example).
If I added a new directory, I also updated .github/CODEOWNERS with the corresponding team in Cray-HPE.
My commits or Pull-Request Title contain my JIRA information, or I do not have a JIRA.

… is enabled

jpdavis-prof · 2026-03-04T15:40:48Z

+  echo "Recreating secret..."
+
+  kubectl delete secret -n loftsman site-init --ignore-not-found
+
+  kubectl create secret -n loftsman generic site-init \
+    --from-file="${tmpdir}/customizations.yaml"


Suggested change

echo "Recreating secret..."

kubectl delete secret -n loftsman site-init --ignore-not-found

kubectl create secret -n loftsman generic site-init \

--from-file="${tmpdir}/customizations.yaml"

echo "Updating secret..."

kubectl create secret -n loftsman generic site-init \

--from-file="${tmpdir}/customizations.yaml" \

--dry-run=client -o yaml | kubectl apply -f -

jpdavis-prof · 2026-03-04T15:50:57Z

+      echo "Daemons still transitioning..."
+
+    # Ensure all MONs are in quorum
+    elif ! ceph quorum_status --format json | grep -q '"quorum_names"'; then


I think the "quorum_names" key always exists in the output, so I think this is just confirming that the key exists, not that "all MONs are in quorum".

jpdavis-prof · 2026-03-04T15:55:31Z

+  updated_manifest=$(echo "$manifest_yaml" \
+    | NEW_MONS="$new_mons" yq e \
+      '.spec.charts[].values.cephExporter.endpoints = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -)


I think this creates .values.cephExporter.endpoints on all charts, even charts that don't already have it.

Suggested change

updated_manifest=$(echo "$manifest_yaml" \

| NEW_MONS="$new_mons" yq e \

'.spec.charts[].values.cephExporter.endpoints = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -)

updated_manifest=$(echo "$manifest_yaml" \

| NEW_MONS="$new_mons" yq e \

'(.spec.charts[] | select(.values.cephExporter) | .values.cephExporter.endpoints) = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -)

jpdavis-prof · 2026-03-04T16:01:56Z

+    # Check for any non-running ceph daemons
+    if ceph orch ps | grep -E 'starting|stopped|error|unknown' > /dev/null; then
+      echo "Daemons still transitioning..."


jpdavis-prof · 2026-03-04T16:05:19Z

+    ```bash
+    for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do
+        if [ "$host" != "ncn-s001" ]; then
+        ceph orch host label rm $host _admin
+        fi
+    done
+    ```


Suggested change

```bash

for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do

if [ "$host" != "ncn-s001" ]; then

ceph orch host label rm $host _admin

fi

done

```

```bash

for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do

if [ "$host" != "ncn-s001" ]; then

ceph orch host label rm "$host" _admin

fi

done

```

jpdavis-prof · 2026-03-04T16:06:25Z

+
+    ```bash
+    for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do
+        ceph orch host label add $host _admin


Suggested change

ceph orch host label add $host _admin

ceph orch host label add "$host" _admin

jpdavis-prof · 2026-03-04T16:07:34Z

        ```

-        The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to `configured`.
+        The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`.


Suggested change

The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`.

The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to become `configured`.

github-actions · 2026-04-13T11:04:13Z

This pull-request has not had activity in over 20 days and is being marked as stale.

CASM-5771 Fix for CEPH issue faced after storage node rollout when RR…

519d67b

… is enabled

sravani-sanigepalli requested review from a team as code owners March 3, 2026 20:38

sravani-sanigepalli added 3 commits March 4, 2026 02:28

CASM-5771 Fixing few issues

5e5f8a0

CASM-5771 Fixing shfmt issues

3c58c49

CASM-5771 Fixing shfmt issues

69ad2f7

jpdavis-prof reviewed Mar 4, 2026

View reviewed changes

Merge branch 'release/1.7' into CASM-5771

6d6a5fa

jpdavis-prof changed the title ~~CASM-5771 Fix for CEPH issue faced after storage node rollout when RR…~~ CASM-5771 Fix for RR Ceph issue Mar 4, 2026

jpdavis-prof marked this pull request as draft March 23, 2026 14:07

github-actions bot added the Stale Hasn't had activity in over 30 days label Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASM-5771 Fix for RR Ceph issue#6537

CASM-5771 Fix for RR Ceph issue#6537
sravani-sanigepalli wants to merge 5 commits intorelease/1.7from
CASM-5771

sravani-sanigepalli commented Mar 3, 2026 •

edited by jpdavis-prof

Loading

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

jpdavis-prof Mar 4, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	ceph orch host label add $host _admin
	ceph orch host label add "$host" _admin

	The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`.
	The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to become `configured`.

Conversation

sravani-sanigepalli commented Mar 3, 2026 • edited by jpdavis-prof Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jpdavis-prof Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sravani-sanigepalli commented Mar 3, 2026 •

edited by jpdavis-prof

Loading