Skip to content

NO-JIRA: Fixed certificate verification error after cluster shutdown#67

Open
vimauro wants to merge 3 commits intoopenshift-eng:mainfrom
vimauro:update-cert-rotation-logic
Open

NO-JIRA: Fixed certificate verification error after cluster shutdown#67
vimauro wants to merge 3 commits intoopenshift-eng:mainfrom
vimauro:update-cert-rotation-logic

Conversation

@vimauro
Copy link
Copy Markdown
Contributor

@vimauro vimauro commented May 7, 2026

When shutting down a cluster, we rotate all the certificates, including the loadbalancer-serving-signer used by the KubeConfig.
This causes access issues when you startup the cluster again (can be bypassed via --insecure-skip-tls-verify )
Initially I thought about having a script to update the kubeconfig and proxy.env , but then I opted for a different solution where we rotate only certificates that are expiring within the next 7 days (the root certs have 10years validity by default).
Usually only aggregator-client-signer has a 24h expiration set by default

Added also some extra timer logging for Ansible

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@vimauro: This pull request explicitly references no jira issue.

Details

In response to this:

When shutting down a cluster, we rotate all the certificates, including the loadbalancer-serving-signer used by the KubeConfig.
This causes access issues when you startup the cluster again (can be bypassed via --insecure-skip-tls-verify )
Initially I thought about having a script to update the kubeconfig and proxy.env , but then I opted for a different solution where we rotate only certificates that are expiring within the next 7 days (the root certs have 10years validity by default).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vimauro
Once this PR has been reviewed and has the lgtm label, please assign jerpeter1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Walkthrough

Enables Ansible task profiling callbacks, clarifies shutdown messaging about short-lived certificates, and changes the certificate refresh script to selectively refresh signer secrets whose expiry is within a configurable threshold (default 168 hours), with targeted deletion, polling, and updated reporting.

Changes

Certificate Refresh Flow (shutdown → refresh script)

Layer / File(s) Summary
CLI / Usage
helpers/refresh-certificates.sh
Usage/help updated to describe refreshing expiring signer certificates and adds --threshold HOURS (default 168).
Configuration / Constants
helpers/refresh-certificates.sh
Adds DEFAULT_THRESHOLD_HOURS=168, input validation for --threshold, and computes THRESHOLD_SECONDS.
Expiry Evaluation
helpers/refresh-certificates.sh
Replaces unconditional renewal with logic that reads each signer's auth.openshift.io/certificate-not-after annotation, parses to epoch, computes remaining time, and builds SIGNERS_TO_REFRESH for those within threshold.
Control Flow / Regeneration
helpers/refresh-certificates.sh
Early-exits when no signers qualify; deletes only selected signer secrets; polls up to 60s (5s interval) for recreation of each secret.
Reporting / Messaging
helpers/refresh-certificates.sh
Prints updated expiry timestamps for refreshed signers and conditions completion/warnings on actual regeneration results (removes blanket "fresh 24-hour validity" claim).
Invocation Messaging
deploy/openshift-clusters/scripts/shutdown-cluster.sh
Console message changed to reference "Refreshing short-lived certificates before shutdown..." when invoking the refresh script.

Ansible Profiling Configuration

Layer / File(s) Summary
Workflow Configuration
deploy/openshift-clusters/ansible.cfg
Adds callbacks_enabled = ansible.posix.profile_tasks, timer alongside existing stdout_callback = debug. Removes trailing whitespace from stdout_callback line.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Operator
    participant ShutdownScript as "Shutdown Script"
    participant RefreshScript as "refresh-certificates.sh"
    participant K8sAPI as "Kubernetes API"

    Operator->>ShutdownScript: run shutdown sequence
    ShutdownScript->>Operator: prints "Refreshing short-lived certificates..."
    ShutdownScript->>RefreshScript: invoke refresh with threshold
    RefreshScript->>K8sAPI: list signer secrets & read annotations
    K8sAPI-->>RefreshScript: return signer list + certificate-not-after
    RefreshScript->>RefreshScript: compute remaining time, select signers to refresh
    alt no signers selected
        RefreshScript->>ShutdownScript: exit early (no action)
    else signers selected
        RefreshScript->>K8sAPI: delete selected signer secrets
        K8sAPI-->>RefreshScript: deletion confirmed
        RefreshScript->>K8sAPI: poll for recreated secrets (up to 60s)
        K8sAPI-->>RefreshScript: secrets recreated / or timeout
        RefreshScript->>ShutdownScript: print updated expiries or warnings
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Microsoft Presidio Analyzer (2.2.362)
helpers/refresh-certificates.sh

Microsoft Presidio Analyzer failed to scan this file


Comment @coderabbitai help to get the list of available commands and usage tips.

@vimauro vimauro marked this pull request as ready for review May 7, 2026 09:15
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci openshift-ci Bot requested review from jaypoulz and slintes May 7, 2026 09:15
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
helpers/refresh-certificates.sh (1)

153-173: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh completion check is too weak and can produce false success.

Line 155 ignores oc delete failure, and Line 166-Line 173 only checks if secrets exist. That can pass without actual rotation. Verify deletion success and wait until certificate-not-after changes for each signer.

Suggested hardening
+declare -A OLD_EXPIRY
+# (populate OLD_EXPIRY when adding signer to SIGNERS_TO_REFRESH)

 echo "Refreshing ${`#SIGNERS_TO_REFRESH`[@]} signer(s)..."
 for signer in "${SIGNERS_TO_REFRESH[@]}"; do
     echo "  Deleting ${signer}..."
-    oc delete secret "${signer}" -n openshift-kube-apiserver-operator --ignore-not-found=true
+    if ! oc delete secret "${signer}" -n openshift-kube-apiserver-operator --ignore-not-found=true; then
+        echo "Error: failed to delete ${signer}"
+        exit 1
+    fi
 done
@@
-    for signer in "${SIGNERS_TO_REFRESH[@]}"; do
-        if ! oc get secret "${signer}" -n openshift-kube-apiserver-operator &>/dev/null; then
+    for signer in "${SIGNERS_TO_REFRESH[@]}"; do
+        NEW_EXPIRY=$(oc get secret "${signer}" -n openshift-kube-apiserver-operator \
+            -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}' 2>/dev/null || true)
+        if [[ -z "${NEW_EXPIRY}" || "${NEW_EXPIRY}" == "${OLD_EXPIRY[${signer}]}" ]]; then
             ALL_EXIST=false
             break
         fi
     done

Also applies to: 182-187

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helpers/refresh-certificates.sh` around lines 153 - 173, The current refresh
loop only ignores delete failures and merely checks for secret existence so it
can report a false success; update the flow around SIGNERS_TO_REFRESH so you (1)
fail if oc delete secret returns a non-zero status for any signer instead of
silently continuing, (2) before deleting each secret read and save the current
certificate-not-after (e.g. from oc get secret ... -o
jsonpath='{.data.tls\\.crt}' | base64 -d | openssl x509 -noout -enddate') into a
map keyed by signer, then after deletion poll each signer's secret and compare
the new certificate-not-after to the saved value, and only consider that signer
rotated when the timestamp changes (use the existing TIMEOUT/ELAPSED loop and
ALL_EXIST logic to wait but replace the existence check with
certificate-not-after comparison); apply the same stronger checks to the similar
block referenced around lines 182-187.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helpers/refresh-certificates.sh`:
- Around line 31-40: The script reads $2 into PROXY_ENV and THRESHOLD_HOURS in
the case branches and later uses THRESHOLD_HOURS in arithmetic without
validation; add explicit checks in the --proxy-env and --threshold handlers (in
the case block handling those options) to ensure a following argument exists and
does not start with '-' (treat as missing), and for --threshold validate the
value is a non-empty integer (e.g., regex '^[0-9]+$') before assigning to
THRESHOLD_HOURS; if validation fails, print a clear usage/error and exit (or set
a safe default), and before any arithmetic conversion of THRESHOLD_HOURS (where
it's currently used) re-check/normalize it to a numeric value to avoid crashes.

---

Outside diff comments:
In `@helpers/refresh-certificates.sh`:
- Around line 153-173: The current refresh loop only ignores delete failures and
merely checks for secret existence so it can report a false success; update the
flow around SIGNERS_TO_REFRESH so you (1) fail if oc delete secret returns a
non-zero status for any signer instead of silently continuing, (2) before
deleting each secret read and save the current certificate-not-after (e.g. from
oc get secret ... -o jsonpath='{.data.tls\\.crt}' | base64 -d | openssl x509
-noout -enddate') into a map keyed by signer, then after deletion poll each
signer's secret and compare the new certificate-not-after to the saved value,
and only consider that signer rotated when the timestamp changes (use the
existing TIMEOUT/ELAPSED loop and ALL_EXIST logic to wait but replace the
existence check with certificate-not-after comparison); apply the same stronger
checks to the similar block referenced around lines 182-187.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: e0e649fd-a476-4ab4-8d88-f949c8792f0f

📥 Commits

Reviewing files that changed from the base of the PR and between 3415c82 and 6c7fa75.

📒 Files selected for processing (3)
  • deploy/openshift-clusters/ansible.cfg
  • deploy/openshift-clusters/scripts/shutdown-cluster.sh
  • helpers/refresh-certificates.sh

Comment thread helpers/refresh-certificates.sh
@vimauro
Copy link
Copy Markdown
Contributor Author

vimauro commented May 7, 2026

/label tide/merge-method-squash

@openshift-ci openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
helpers/refresh-certificates.sh (1)

198-203: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing exit 1 on timeout/failure path causes silent failures in callers.

When ALL_EXIST is still false after the 60 s polling window, the script prints a warning but exits 0. The Ansible playbook (and any other caller) will treat this as success, masking a real regeneration failure.

🐛 Proposed fix
 if [[ "${ALL_EXIST}" == "true" ]]; then
     echo "Certificate refresh completed successfully!"
 else
     echo "Warning: Some certificates may still be regenerating."
     echo "Check kube-apiserver-operator logs if issues persist."
+    exit 1
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helpers/refresh-certificates.sh` around lines 198 - 203, The else branch
currently only prints warnings when ALL_EXIST is false, causing the script to
exit with status 0; change the failure path in the final if that checks
ALL_EXIST to exit with a non-zero status (e.g., exit 1) after printing the
warning messages so callers detect the timeout/failure. Ensure you modify the
if/else block that references the ALL_EXIST variable (the final certificate
refresh completion check) to keep the existing warning text and add the exit 1
on the failure branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@helpers/refresh-certificates.sh`:
- Around line 198-203: The else branch currently only prints warnings when
ALL_EXIST is false, causing the script to exit with status 0; change the failure
path in the final if that checks ALL_EXIST to exit with a non-zero status (e.g.,
exit 1) after printing the warning messages so callers detect the
timeout/failure. Ensure you modify the if/else block that references the
ALL_EXIST variable (the final certificate refresh completion check) to keep the
existing warning text and add the exit 1 on the failure branch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4bc097b8-eb93-40b6-88c3-14560db8e012

📥 Commits

Reviewing files that changed from the base of the PR and between 6c7fa75 and d6b90c0.

📒 Files selected for processing (1)
  • helpers/refresh-certificates.sh

Comment thread helpers/refresh-certificates.sh
Copy link
Copy Markdown
Contributor

@lucaconsalvi lucaconsalvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — selective rotation is the right approach. Three minor observations below.

Comment thread helpers/refresh-certificates.sh
Comment thread helpers/refresh-certificates.sh
Comment thread helpers/refresh-certificates.sh
@vimauro
Copy link
Copy Markdown
Contributor Author

vimauro commented May 7, 2026

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2026
@vimauro
Copy link
Copy Markdown
Contributor Author

vimauro commented May 7, 2026

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2026
@vimauro
Copy link
Copy Markdown
Contributor Author

vimauro commented May 7, 2026

Tested locally, without forcing the kube-apiserver-operator patching, we can get the operator updating the cert within 2 minutes approximately

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helpers/refresh-certificates.sh`:
- Around line 130-139: The script prints "No refresh needed." and exits 0 even
when signer inspection failed (oc get secret/date parsing), which hides errors;
modify helpers/refresh-certificates.sh to track failures by introducing a
failure flag (e.g. INSPECTION_FAILED) and set it when EXPIRY is empty or
EXPIRY_EPOCH parsing yields 0 in the blocks around EXPIRY and EXPIRY_EPOCH, and
also in the analogous block at lines ~154-157; after processing signers, if
INSPECTION_FAILED is true exit with a non-zero status (instead of 0) and a clear
error message, otherwise continue to the existing "No refresh needed." success
path.
- Around line 160-176: The current loop using SIGNERS_TO_REFRESH, TIMEOUT,
ELAPSED and ALL_EXIST only verifies that secrets exist (oc get) and can
false-pass if delete failed or old secrets remain; change the check to record
each signer's pre-delete identity (e.g., metadata.resourceVersion or annotations
from oc get secret "${signer}" -o jsonpath='{.metadata.resourceVersion}') before
deletion and then in the wait loop require that the secret both exists and has a
different resourceVersion (or other cert-specific field) than the recorded value
for that signer; update the loop that uses ALL_EXIST to fetch the new
resourceVersion for each signer and only set ALL_EXIST=true when every signer
exists and has a changed resourceVersion (or cert data), with the same
TIMEOUT/ELAPSED logic retained.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9e39f534-c7bd-4335-a98d-3717a2cd5cf1

📥 Commits

Reviewing files that changed from the base of the PR and between d6b90c0 and 06c883b.

📒 Files selected for processing (1)
  • helpers/refresh-certificates.sh

Comment thread helpers/refresh-certificates.sh
Comment thread helpers/refresh-certificates.sh
@lucaconsalvi
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2026
@vimauro vimauro changed the title NO-JIRA: Fixed certificate expiration after cluster shutdown NO-JIRA: Fixed certificate verification error after cluster shutdown May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants