NO-JIRA: Fixed certificate verification error after cluster shutdown#67
NO-JIRA: Fixed certificate verification error after cluster shutdown#67vimauro wants to merge 3 commits intoopenshift-eng:mainfrom
Conversation
|
@vimauro: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vimauro The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughEnables Ansible task profiling callbacks, clarifies shutdown messaging about short-lived certificates, and changes the certificate refresh script to selectively refresh signer secrets whose expiry is within a configurable threshold (default 168 hours), with targeted deletion, polling, and updated reporting. ChangesCertificate Refresh Flow (shutdown → refresh script)
Ansible Profiling Configuration
Sequence Diagram(s)sequenceDiagram
autonumber
actor Operator
participant ShutdownScript as "Shutdown Script"
participant RefreshScript as "refresh-certificates.sh"
participant K8sAPI as "Kubernetes API"
Operator->>ShutdownScript: run shutdown sequence
ShutdownScript->>Operator: prints "Refreshing short-lived certificates..."
ShutdownScript->>RefreshScript: invoke refresh with threshold
RefreshScript->>K8sAPI: list signer secrets & read annotations
K8sAPI-->>RefreshScript: return signer list + certificate-not-after
RefreshScript->>RefreshScript: compute remaining time, select signers to refresh
alt no signers selected
RefreshScript->>ShutdownScript: exit early (no action)
else signers selected
RefreshScript->>K8sAPI: delete selected signer secrets
K8sAPI-->>RefreshScript: deletion confirmed
RefreshScript->>K8sAPI: poll for recreated secrets (up to 60s)
K8sAPI-->>RefreshScript: secrets recreated / or timeout
RefreshScript->>ShutdownScript: print updated expiries or warnings
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Microsoft Presidio Analyzer (2.2.362)helpers/refresh-certificates.shMicrosoft Presidio Analyzer failed to scan this file Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
helpers/refresh-certificates.sh (1)
153-173:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRefresh completion check is too weak and can produce false success.
Line 155 ignores
oc deletefailure, and Line 166-Line 173 only checks if secrets exist. That can pass without actual rotation. Verify deletion success and wait untilcertificate-not-afterchanges for each signer.Suggested hardening
+declare -A OLD_EXPIRY +# (populate OLD_EXPIRY when adding signer to SIGNERS_TO_REFRESH) echo "Refreshing ${`#SIGNERS_TO_REFRESH`[@]} signer(s)..." for signer in "${SIGNERS_TO_REFRESH[@]}"; do echo " Deleting ${signer}..." - oc delete secret "${signer}" -n openshift-kube-apiserver-operator --ignore-not-found=true + if ! oc delete secret "${signer}" -n openshift-kube-apiserver-operator --ignore-not-found=true; then + echo "Error: failed to delete ${signer}" + exit 1 + fi done @@ - for signer in "${SIGNERS_TO_REFRESH[@]}"; do - if ! oc get secret "${signer}" -n openshift-kube-apiserver-operator &>/dev/null; then + for signer in "${SIGNERS_TO_REFRESH[@]}"; do + NEW_EXPIRY=$(oc get secret "${signer}" -n openshift-kube-apiserver-operator \ + -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}' 2>/dev/null || true) + if [[ -z "${NEW_EXPIRY}" || "${NEW_EXPIRY}" == "${OLD_EXPIRY[${signer}]}" ]]; then ALL_EXIST=false break fi doneAlso applies to: 182-187
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@helpers/refresh-certificates.sh` around lines 153 - 173, The current refresh loop only ignores delete failures and merely checks for secret existence so it can report a false success; update the flow around SIGNERS_TO_REFRESH so you (1) fail if oc delete secret returns a non-zero status for any signer instead of silently continuing, (2) before deleting each secret read and save the current certificate-not-after (e.g. from oc get secret ... -o jsonpath='{.data.tls\\.crt}' | base64 -d | openssl x509 -noout -enddate') into a map keyed by signer, then after deletion poll each signer's secret and compare the new certificate-not-after to the saved value, and only consider that signer rotated when the timestamp changes (use the existing TIMEOUT/ELAPSED loop and ALL_EXIST logic to wait but replace the existence check with certificate-not-after comparison); apply the same stronger checks to the similar block referenced around lines 182-187.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helpers/refresh-certificates.sh`:
- Around line 31-40: The script reads $2 into PROXY_ENV and THRESHOLD_HOURS in
the case branches and later uses THRESHOLD_HOURS in arithmetic without
validation; add explicit checks in the --proxy-env and --threshold handlers (in
the case block handling those options) to ensure a following argument exists and
does not start with '-' (treat as missing), and for --threshold validate the
value is a non-empty integer (e.g., regex '^[0-9]+$') before assigning to
THRESHOLD_HOURS; if validation fails, print a clear usage/error and exit (or set
a safe default), and before any arithmetic conversion of THRESHOLD_HOURS (where
it's currently used) re-check/normalize it to a numeric value to avoid crashes.
---
Outside diff comments:
In `@helpers/refresh-certificates.sh`:
- Around line 153-173: The current refresh loop only ignores delete failures and
merely checks for secret existence so it can report a false success; update the
flow around SIGNERS_TO_REFRESH so you (1) fail if oc delete secret returns a
non-zero status for any signer instead of silently continuing, (2) before
deleting each secret read and save the current certificate-not-after (e.g. from
oc get secret ... -o jsonpath='{.data.tls\\.crt}' | base64 -d | openssl x509
-noout -enddate') into a map keyed by signer, then after deletion poll each
signer's secret and compare the new certificate-not-after to the saved value,
and only consider that signer rotated when the timestamp changes (use the
existing TIMEOUT/ELAPSED loop and ALL_EXIST logic to wait but replace the
existence check with certificate-not-after comparison); apply the same stronger
checks to the similar block referenced around lines 182-187.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: e0e649fd-a476-4ab4-8d88-f949c8792f0f
📒 Files selected for processing (3)
deploy/openshift-clusters/ansible.cfgdeploy/openshift-clusters/scripts/shutdown-cluster.shhelpers/refresh-certificates.sh
|
/label tide/merge-method-squash |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
helpers/refresh-certificates.sh (1)
198-203:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing
exit 1on timeout/failure path causes silent failures in callers.When
ALL_EXISTis stillfalseafter the 60 s polling window, the script prints a warning but exits0. The Ansible playbook (and any other caller) will treat this as success, masking a real regeneration failure.🐛 Proposed fix
if [[ "${ALL_EXIST}" == "true" ]]; then echo "Certificate refresh completed successfully!" else echo "Warning: Some certificates may still be regenerating." echo "Check kube-apiserver-operator logs if issues persist." + exit 1 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@helpers/refresh-certificates.sh` around lines 198 - 203, The else branch currently only prints warnings when ALL_EXIST is false, causing the script to exit with status 0; change the failure path in the final if that checks ALL_EXIST to exit with a non-zero status (e.g., exit 1) after printing the warning messages so callers detect the timeout/failure. Ensure you modify the if/else block that references the ALL_EXIST variable (the final certificate refresh completion check) to keep the existing warning text and add the exit 1 on the failure branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@helpers/refresh-certificates.sh`:
- Around line 198-203: The else branch currently only prints warnings when
ALL_EXIST is false, causing the script to exit with status 0; change the failure
path in the final if that checks ALL_EXIST to exit with a non-zero status (e.g.,
exit 1) after printing the warning messages so callers detect the
timeout/failure. Ensure you modify the if/else block that references the
ALL_EXIST variable (the final certificate refresh completion check) to keep the
existing warning text and add the exit 1 on the failure branch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 4bc097b8-eb93-40b6-88c3-14560db8e012
📒 Files selected for processing (1)
helpers/refresh-certificates.sh
lucaconsalvi
left a comment
There was a problem hiding this comment.
Nice fix — selective rotation is the right approach. Three minor observations below.
|
/hold |
|
/unhold |
|
Tested locally, without forcing the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helpers/refresh-certificates.sh`:
- Around line 130-139: The script prints "No refresh needed." and exits 0 even
when signer inspection failed (oc get secret/date parsing), which hides errors;
modify helpers/refresh-certificates.sh to track failures by introducing a
failure flag (e.g. INSPECTION_FAILED) and set it when EXPIRY is empty or
EXPIRY_EPOCH parsing yields 0 in the blocks around EXPIRY and EXPIRY_EPOCH, and
also in the analogous block at lines ~154-157; after processing signers, if
INSPECTION_FAILED is true exit with a non-zero status (instead of 0) and a clear
error message, otherwise continue to the existing "No refresh needed." success
path.
- Around line 160-176: The current loop using SIGNERS_TO_REFRESH, TIMEOUT,
ELAPSED and ALL_EXIST only verifies that secrets exist (oc get) and can
false-pass if delete failed or old secrets remain; change the check to record
each signer's pre-delete identity (e.g., metadata.resourceVersion or annotations
from oc get secret "${signer}" -o jsonpath='{.metadata.resourceVersion}') before
deletion and then in the wait loop require that the secret both exists and has a
different resourceVersion (or other cert-specific field) than the recorded value
for that signer; update the loop that uses ALL_EXIST to fetch the new
resourceVersion for each signer and only set ALL_EXIST=true when every signer
exists and has a changed resourceVersion (or cert data), with the same
TIMEOUT/ELAPSED logic retained.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 9e39f534-c7bd-4335-a98d-3717a2cd5cf1
📒 Files selected for processing (1)
helpers/refresh-certificates.sh
|
/lgtm |
When shutting down a cluster, we rotate all the certificates, including the
loadbalancer-serving-signerused by the KubeConfig.This causes access issues when you startup the cluster again (can be bypassed via
--insecure-skip-tls-verify)Initially I thought about having a script to update the kubeconfig and proxy.env , but then I opted for a different solution where we rotate only certificates that are expiring within the next 7 days (the root certs have 10years validity by default).
Usually only
aggregator-client-signerhas a 24h expiration set by defaultAdded also some extra timer logging for Ansible