Skip to content

CASM-5771 Fix for RR Ceph issue#6537

Draft
sravani-sanigepalli wants to merge 5 commits intorelease/1.7from
CASM-5771
Draft

CASM-5771 Fix for RR Ceph issue#6537
sravani-sanigepalli wants to merge 5 commits intorelease/1.7from
CASM-5771

Conversation

@sravani-sanigepalli
Copy link
Copy Markdown
Collaborator

@sravani-sanigepalli sravani-sanigepalli commented Mar 3, 2026

Description

Resolves

An issue observed on CSCS ALPS -

When Rack resiliency is enabled, canary storage node rollout (ncn-s001) changed the k8s monitors from 3 to 5 and picked up a completely new set of nodes to run monitors (from s001,2,3 to s008,9,10,11,12). It caused the following issues -

CEPH commands are hung on management master and storage nodes (except for ncn-s002 in case of ALPS)
Grafana dashboards shows wrong information
When k8s pods are restarted, they are stuck in init state as they are not able to mount PVCs (output added below)

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    14m                  default-scheduler  Successfully assigned services/cray-power-control-bitnami-etcd-2 to ncn-w003
  Warning  FailedMount  4m24s (x9 over 12m)  kubelet            MountVolume.MountDevice failed for volume "pvc-ce7deb3a-e97c-45c1-a931-2853a0cab7ac" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-4e2bc4d8-1658-11f1-b57e-42010afc0100-000000000000000b-c071f8a9-0125-4004-a852-f7ac928c6e2d already exists
  Warning  FailedMount  22s (x3 over 12m)    kubelet            MountVolume.MountDevice failed for volume "pvc-ce7deb3a-e97c-45c1-a931-2853a0cab7ac" : rpc error: code = DeadlineExceeded desc = context deadline exceeded 

This is because of storage nodes rollout (RR ansible plays) failing to update the ceph configuration on the storage nodes (/etc/ceph/ceph.conf) , customizations.yaml and the k8s config maps showed below:

kubectl -n ceph-rdb edit cm ceph-csi-config
kubectl -n ceph-cephfs edit cm ceph-csi-config
kubectl -n default edit cm ceph-csi-config
kubectl -n services edit cm ceph-csi-config
kubectl -n backups edit cm ceph-etc
kubectl edit -n loftsman cm loftsman-cray-sysmgmt-health
kubectl get secrets -n loftsman site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d > customizations.yaml

Following changes are made as part of the fix :

  • Adding _admin label to all storage nodes before starting storage node rollouts. With this, any configuration changes would automatically get updated on all the storage nodes (including the updating of /etc/ceph/ceph.conf). This change is reverted back once all the storage nodes rollout is finished.
  • Added a script after first storage node rollout which
    1. Would wait for CEPH services to settle down (which are modified as part of the role rr.ceph_zoning in the rack_resiliency_for_mgmt_nodes.yml playbook)
    2. Updates various configmaps with latest monitor configurations - This fixes pod mount issues in pod restarts and grafana dashboard issues.
    3. Copies the latest ceph.conf to all master nodes
  • Added troubleshooting steps for a potential timing-related corner case that may result in the CFS session created during the canary storage node rollout becoming stuck.

Checklist

  • If I added any command snippets, the steps they belong to follow the prompt conventions (see example).
  • If I added a new directory, I also updated .github/CODEOWNERS with the corresponding team in Cray-HPE.
  • My commits or Pull-Request Title contain my JIRA information, or I do not have a JIRA.

Comment on lines +146 to +151
echo "Recreating secret..."

kubectl delete secret -n loftsman site-init --ignore-not-found

kubectl create secret -n loftsman generic site-init \
--from-file="${tmpdir}/customizations.yaml"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Recreating secret..."
kubectl delete secret -n loftsman site-init --ignore-not-found
kubectl create secret -n loftsman generic site-init \
--from-file="${tmpdir}/customizations.yaml"
echo "Updating secret..."
kubectl create secret -n loftsman generic site-init \
--from-file="${tmpdir}/customizations.yaml" \
--dry-run=client -o yaml | kubectl apply -f -

echo "Daemons still transitioning..."

# Ensure all MONs are in quorum
elif ! ceph quorum_status --format json | grep -q '"quorum_names"'; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "quorum_names" key always exists in the output, so I think this is just confirming that the key exists, not that "all MONs are in quorum".

Comment on lines +163 to +165
updated_manifest=$(echo "$manifest_yaml" \
| NEW_MONS="$new_mons" yq e \
'.spec.charts[].values.cephExporter.endpoints = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this creates .values.cephExporter.endpoints on all charts, even charts that don't already have it.

Suggested change
updated_manifest=$(echo "$manifest_yaml" \
| NEW_MONS="$new_mons" yq e \
'.spec.charts[].values.cephExporter.endpoints = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -)
updated_manifest=$(echo "$manifest_yaml" \
| NEW_MONS="$new_mons" yq e \
'(.spec.charts[] | select(.values.cephExporter) | .values.cephExporter.endpoints) = (strenv(NEW_MONS) | split("\n") | map(select(length > 0)))' -)

Comment on lines +50 to +52
# Check for any non-running ceph daemons
if ceph orch ps | grep -E 'starting|stopped|error|unknown' > /dev/null; then
echo "Daemons still transitioning..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check for any non-running ceph daemons
if ceph orch ps | grep -E 'starting|stopped|error|unknown' > /dev/null; then
echo "Daemons still transitioning..."
# Check for any non-running ceph daemons
if ceph orch ps --format json | jq -e '[.[].status_desc | select(test("starting|stopped|error|unknown"))] | length > 0' > /dev/null; then
echo "Daemons still transitioning..."

Comment on lines +299 to +305
```bash
for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do
if [ "$host" != "ncn-s001" ]; then
ceph orch host label rm $host _admin
fi
done
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```bash
for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do
if [ "$host" != "ncn-s001" ]; then
ceph orch host label rm $host _admin
fi
done
```
```bash
for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do
if [ "$host" != "ncn-s001" ]; then
ceph orch host label rm "$host" _admin
fi
done
```


```bash
for host in $(ceph orch host ls --format json | jq -r '.[].hostname'); do
ceph orch host label add $host _admin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ceph orch host label add $host _admin
ceph orch host label add "$host" _admin

```

The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to `configured`.
The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to change to be `configured`.
The desired value for `configuration_status` is `configured`. If it is `pending`, then wait for the status to become `configured`.

@jpdavis-prof jpdavis-prof changed the title CASM-5771 Fix for CEPH issue faced after storage node rollout when RR… CASM-5771 Fix for RR Ceph issue Mar 4, 2026
@jpdavis-prof jpdavis-prof marked this pull request as draft March 23, 2026 14:07
@github-actions
Copy link
Copy Markdown
Contributor

This pull-request has not had activity in over 20 days and is being marked as stale.

@github-actions github-actions bot added the Stale Hasn't had activity in over 30 days label Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Stale Hasn't had activity in over 30 days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants