NO-JIRA: Fixed certificate verification error after cluster shutdown by vimauro · Pull Request #67 · openshift-eng/two-node-toolbox

vimauro · 2026-05-07T09:14:39Z

When shutting down a cluster, we rotate all the certificates, including the loadbalancer-serving-signer used by the KubeConfig.
This causes access issues when you startup the cluster again (can be bypassed via --insecure-skip-tls-verify )
Initially I thought about having a script to update the kubeconfig and proxy.env , but then I opted for a different solution where we rotate only certificates that are expiring within the next 7 days (the root certs have 10years validity by default).
Usually only aggregator-client-signer has a 24h expiration set by default

Added also some extra timer logging for Ansible

openshift-ci-robot · 2026-05-07T09:14:43Z

@vimauro: This pull request explicitly references no jira issue.

Details

In response to this:

When shutting down a cluster, we rotate all the certificates, including the loadbalancer-serving-signer used by the KubeConfig.
This causes access issues when you startup the cluster again (can be bypassed via --insecure-skip-tls-verify )
Initially I thought about having a script to update the kubeconfig and proxy.env , but then I opted for a different solution where we rotate only certificates that are expiring within the next 7 days (the root certs have 10years validity by default).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-07T09:14:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-05-07T09:14:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vimauro
Once this PR has been reviewed and has the lgtm label, please assign jerpeter1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-07T09:14:47Z

Walkthrough

Enables Ansible task profiling callbacks, clarifies shutdown messaging about short-lived certificates, and changes the certificate refresh script to selectively refresh signer secrets whose expiry is within a configurable threshold (default 168 hours), with targeted deletion, polling, and updated reporting.

Changes

Certificate Refresh Flow (shutdown → refresh script)

Layer / File(s)	Summary
CLI / Usage `helpers/refresh-certificates.sh`	Usage/help updated to describe refreshing expiring signer certificates and adds `--threshold HOURS` (default 168).
Configuration / Constants `helpers/refresh-certificates.sh`	Adds `DEFAULT_THRESHOLD_HOURS=168`, input validation for `--threshold`, and computes `THRESHOLD_SECONDS`.
Expiry Evaluation `helpers/refresh-certificates.sh`	Replaces unconditional renewal with logic that reads each signer's `auth.openshift.io/certificate-not-after` annotation, parses to epoch, computes remaining time, and builds `SIGNERS_TO_REFRESH` for those within threshold.
Control Flow / Regeneration `helpers/refresh-certificates.sh`	Early-exits when no signers qualify; deletes only selected signer secrets; polls up to 60s (5s interval) for recreation of each secret.
Reporting / Messaging `helpers/refresh-certificates.sh`	Prints updated expiry timestamps for refreshed signers and conditions completion/warnings on actual regeneration results (removes blanket "fresh 24-hour validity" claim).
Invocation Messaging `deploy/openshift-clusters/scripts/shutdown-cluster.sh`	Console message changed to reference "Refreshing short-lived certificates before shutdown..." when invoking the refresh script.

Ansible Profiling Configuration

Layer / File(s)	Summary
Workflow Configuration `deploy/openshift-clusters/ansible.cfg`	Adds `callbacks_enabled = ansible.posix.profile_tasks, timer` alongside existing `stdout_callback = debug`. Removes trailing whitespace from `stdout_callback` line.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Operator
    participant ShutdownScript as "Shutdown Script"
    participant RefreshScript as "refresh-certificates.sh"
    participant K8sAPI as "Kubernetes API"

    Operator->>ShutdownScript: run shutdown sequence
    ShutdownScript->>Operator: prints "Refreshing short-lived certificates..."
    ShutdownScript->>RefreshScript: invoke refresh with threshold
    RefreshScript->>K8sAPI: list signer secrets & read annotations
    K8sAPI-->>RefreshScript: return signer list + certificate-not-after
    RefreshScript->>RefreshScript: compute remaining time, select signers to refresh
    alt no signers selected
        RefreshScript->>ShutdownScript: exit early (no action)
    else signers selected
        RefreshScript->>K8sAPI: delete selected signer secrets
        K8sAPI-->>RefreshScript: deletion confirmed
        RefreshScript->>K8sAPI: poll for recreated secrets (up to 60s)
        K8sAPI-->>RefreshScript: secrets recreated / or timeout
        RefreshScript->>ShutdownScript: print updated expiries or warnings
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Microsoft Presidio Analyzer (2.2.362)

helpers/refresh-certificates.sh

Microsoft Presidio Analyzer failed to scan this file

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

helpers/refresh-certificates.sh (1)

153-173: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh completion check is too weak and can produce false success.

Line 155 ignores oc delete failure, and Line 166-Line 173 only checks if secrets exist. That can pass without actual rotation. Verify deletion success and wait until certificate-not-after changes for each signer.

Suggested hardening

+declare -A OLD_EXPIRY
+# (populate OLD_EXPIRY when adding signer to SIGNERS_TO_REFRESH)

 echo "Refreshing ${`#SIGNERS_TO_REFRESH`[@]} signer(s)..."
 for signer in "${SIGNERS_TO_REFRESH[@]}"; do
     echo "  Deleting ${signer}..."
-    oc delete secret "${signer}" -n openshift-kube-apiserver-operator --ignore-not-found=true
+    if ! oc delete secret "${signer}" -n openshift-kube-apiserver-operator --ignore-not-found=true; then
+        echo "Error: failed to delete ${signer}"
+        exit 1
+    fi
 done
@@
-    for signer in "${SIGNERS_TO_REFRESH[@]}"; do
-        if ! oc get secret "${signer}" -n openshift-kube-apiserver-operator &>/dev/null; then
+    for signer in "${SIGNERS_TO_REFRESH[@]}"; do
+        NEW_EXPIRY=$(oc get secret "${signer}" -n openshift-kube-apiserver-operator \
+            -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}' 2>/dev/null || true)
+        if [[ -z "${NEW_EXPIRY}" || "${NEW_EXPIRY}" == "${OLD_EXPIRY[${signer}]}" ]]; then
             ALL_EXIST=false
             break
         fi
     done

Also applies to: 182-187

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helpers/refresh-certificates.sh` around lines 153 - 173, The current refresh
loop only ignores delete failures and merely checks for secret existence so it
can report a false success; update the flow around SIGNERS_TO_REFRESH so you (1)
fail if oc delete secret returns a non-zero status for any signer instead of
silently continuing, (2) before deleting each secret read and save the current
certificate-not-after (e.g. from oc get secret ... -o
jsonpath='{.data.tls\\.crt}' | base64 -d | openssl x509 -noout -enddate') into a
map keyed by signer, then after deletion poll each signer's secret and compare
the new certificate-not-after to the saved value, and only consider that signer
rotated when the timestamp changes (use the existing TIMEOUT/ELAPSED loop and
ALL_EXIST logic to wait but replace the existence check with
certificate-not-after comparison); apply the same stronger checks to the similar
block referenced around lines 182-187.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helpers/refresh-certificates.sh`:
- Around line 31-40: The script reads $2 into PROXY_ENV and THRESHOLD_HOURS in
the case branches and later uses THRESHOLD_HOURS in arithmetic without
validation; add explicit checks in the --proxy-env and --threshold handlers (in
the case block handling those options) to ensure a following argument exists and
does not start with '-' (treat as missing), and for --threshold validate the
value is a non-empty integer (e.g., regex '^[0-9]+$') before assigning to
THRESHOLD_HOURS; if validation fails, print a clear usage/error and exit (or set
a safe default), and before any arithmetic conversion of THRESHOLD_HOURS (where
it's currently used) re-check/normalize it to a numeric value to avoid crashes.

---

Outside diff comments:
In `@helpers/refresh-certificates.sh`:
- Around line 153-173: The current refresh loop only ignores delete failures and
merely checks for secret existence so it can report a false success; update the
flow around SIGNERS_TO_REFRESH so you (1) fail if oc delete secret returns a
non-zero status for any signer instead of silently continuing, (2) before
deleting each secret read and save the current certificate-not-after (e.g. from
oc get secret ... -o jsonpath='{.data.tls\\.crt}' | base64 -d | openssl x509
-noout -enddate') into a map keyed by signer, then after deletion poll each
signer's secret and compare the new certificate-not-after to the saved value,
and only consider that signer rotated when the timestamp changes (use the
existing TIMEOUT/ELAPSED loop and ALL_EXIST logic to wait but replace the
existence check with certificate-not-after comparison); apply the same stronger
checks to the similar block referenced around lines 182-187.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: e0e649fd-a476-4ab4-8d88-f949c8792f0f

📥 Commits

Reviewing files that changed from the base of the PR and between 3415c82 and 6c7fa75.

📒 Files selected for processing (3)

deploy/openshift-clusters/ansible.cfg
deploy/openshift-clusters/scripts/shutdown-cluster.sh
helpers/refresh-certificates.sh

vimauro · 2026-05-07T09:23:01Z

/label tide/merge-method-squash

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

helpers/refresh-certificates.sh (1)

198-203: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing exit 1 on timeout/failure path causes silent failures in callers.

When ALL_EXIST is still false after the 60 s polling window, the script prints a warning but exits 0. The Ansible playbook (and any other caller) will treat this as success, masking a real regeneration failure.

🐛 Proposed fix

 if [[ "${ALL_EXIST}" == "true" ]]; then
     echo "Certificate refresh completed successfully!"
 else
     echo "Warning: Some certificates may still be regenerating."
     echo "Check kube-apiserver-operator logs if issues persist."
+    exit 1
 fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helpers/refresh-certificates.sh` around lines 198 - 203, The else branch
currently only prints warnings when ALL_EXIST is false, causing the script to
exit with status 0; change the failure path in the final if that checks
ALL_EXIST to exit with a non-zero status (e.g., exit 1) after printing the
warning messages so callers detect the timeout/failure. Ensure you modify the
if/else block that references the ALL_EXIST variable (the final certificate
refresh completion check) to keep the existing warning text and add the exit 1
on the failure branch.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@helpers/refresh-certificates.sh`:
- Around line 198-203: The else branch currently only prints warnings when
ALL_EXIST is false, causing the script to exit with status 0; change the failure
path in the final if that checks ALL_EXIST to exit with a non-zero status (e.g.,
exit 1) after printing the warning messages so callers detect the
timeout/failure. Ensure you modify the if/else block that references the
ALL_EXIST variable (the final certificate refresh completion check) to keep the
existing warning text and add the exit 1 on the failure branch.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4bc097b8-eb93-40b6-88c3-14560db8e012

📥 Commits

Reviewing files that changed from the base of the PR and between 6c7fa75 and d6b90c0.

📒 Files selected for processing (1)

helpers/refresh-certificates.sh

lucaconsalvi

Nice fix — selective rotation is the right approach. Three minor observations below.

vimauro · 2026-05-07T13:33:53Z

/hold

vimauro · 2026-05-07T14:36:54Z

/unhold

vimauro · 2026-05-07T14:38:39Z

Tested locally, without forcing the kube-apiserver-operator patching, we can get the operator updating the cert within 2 minutes approximately

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helpers/refresh-certificates.sh`:
- Around line 130-139: The script prints "No refresh needed." and exits 0 even
when signer inspection failed (oc get secret/date parsing), which hides errors;
modify helpers/refresh-certificates.sh to track failures by introducing a
failure flag (e.g. INSPECTION_FAILED) and set it when EXPIRY is empty or
EXPIRY_EPOCH parsing yields 0 in the blocks around EXPIRY and EXPIRY_EPOCH, and
also in the analogous block at lines ~154-157; after processing signers, if
INSPECTION_FAILED is true exit with a non-zero status (instead of 0) and a clear
error message, otherwise continue to the existing "No refresh needed." success
path.
- Around line 160-176: The current loop using SIGNERS_TO_REFRESH, TIMEOUT,
ELAPSED and ALL_EXIST only verifies that secrets exist (oc get) and can
false-pass if delete failed or old secrets remain; change the check to record
each signer's pre-delete identity (e.g., metadata.resourceVersion or annotations
from oc get secret "${signer}" -o jsonpath='{.metadata.resourceVersion}') before
deletion and then in the wait loop require that the secret both exists and has a
different resourceVersion (or other cert-specific field) than the recorded value
for that signer; update the loop that uses ALL_EXIST to fetch the new
resourceVersion for each signer and only set ALL_EXIST=true when every signer
exists and has a changed resourceVersion (or cert data), with the same
TIMEOUT/ELAPSED logic retained.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9e39f534-c7bd-4335-a98d-3717a2cd5cf1

📥 Commits

Reviewing files that changed from the base of the PR and between d6b90c0 and 06c883b.

📒 Files selected for processing (1)

helpers/refresh-certificates.sh

lucaconsalvi · 2026-05-07T15:01:59Z

/lgtm

fix: fixed certificate expiration after cluster shutdown

6c7fa75

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026

vimauro marked this pull request as ready for review May 7, 2026 09:15

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026

openshift-ci Bot requested review from jaypoulz and slintes May 7, 2026 09:15

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread helpers/refresh-certificates.sh

openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 7, 2026

addressed CR comment

d6b90c0

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread helpers/refresh-certificates.sh

lucaconsalvi reviewed May 7, 2026

View reviewed changes

Comment thread helpers/refresh-certificates.sh

Comment thread helpers/refresh-certificates.sh

Comment thread helpers/refresh-certificates.sh

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2026

addessed PR comments

06c883b

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread helpers/refresh-certificates.sh

Comment thread helpers/refresh-certificates.sh

openshift-ci Bot assigned lucaconsalvi May 7, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2026

vimauro changed the title ~~NO-JIRA: Fixed certificate expiration after cluster shutdown~~ NO-JIRA: Fixed certificate verification error after cluster shutdown May 7, 2026

Conversation

vimauro commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented May 7, 2026

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vimauro commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lucaconsalvi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vimauro commented May 7, 2026

Uh oh!

vimauro commented May 7, 2026

Uh oh!

vimauro commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucaconsalvi commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vimauro commented May 7, 2026 •

edited

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading