OCPBUGS-60588: [TNF] support restore for pacemaker-managed etcd#1521
Conversation
|
Skipping CI for Draft Pull Request. |
WalkthroughAdds TNF-specific etcd restore/disable scripts and helpers, extends etcd-common-tools with stop/wait and completion messaging, makes ScriptController deploy topology-aware scripts using an Infrastructure lister and EnvVarGetter, and adds unit tests for the script deployment behavior. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
| return 0 | ||
| fi | ||
|
|
||
| if systemctl status pacemaker >/dev/null 2>&1 ; then |
There was a problem hiding this comment.
this is smart, until somebody else comes around and uses pacemaker on the control plane 🤣
Can we make this more specific? there is likely the infra CRD somewhere on /etc/kubernetes/manifest.
Otherwise, this is also a advantage of having your own restore script installed by the operator only on TNF - that gives you the guarantee that you definitely run on TNF, allowing you to remove this altogether
| function get_etcd_advertise_ip() { | ||
| # Try to detect IP from etcd.env (NODE_<nodename>_IP variable) | ||
| NODENAME_UNDERSCORE=$(echo "${NODENAME}" | tr '-' '_') | ||
| NODE_IP_VAR="NODE_${NODENAME_UNDERSCORE}_IP" | ||
| IP="${!NODE_IP_VAR}" | ||
| echo "$IP" | ||
| } |
There was a problem hiding this comment.
wouldn't it always be localhost:2379?
| done | ||
|
|
||
| echo "timed out waiting for etcd resources to start (timeout: $timeout seconds)" | ||
| return 1 |
There was a problem hiding this comment.
looks like you would want to exit 1 here directly, right?
There was a problem hiding this comment.
you also don't exit on the return of a 1 below
| function ensure_etcd_container_stopped() { | ||
| echo "ensure ${ETCD_CONTAINER_NAME} container is stopped" | ||
| local start=$SECONDS | ||
| local timeout=120 | ||
| local disable_etcd_called=0 | ||
| while [ $((SECONDS - start)) -lt "$timeout" ]; do | ||
| if podman container exists "${ETCD_CONTAINER_NAME}" && \ | ||
| [[ "$(podman inspect -f '{{.State.Status}}' "${ETCD_CONTAINER_NAME}")" != "exited" ]]; then | ||
|
|
||
| if [ "$disable_etcd_called" -eq 1 ]; then | ||
| sleep 1 | ||
| continue | ||
| fi | ||
|
|
||
| if ! pcs resource disable etcd; then | ||
| echo "could not disable podman-etcd. Please ensure the ${ETCD_CONTAINER_NAME} container is stopped" | ||
| exit 1 | ||
| fi | ||
|
|
||
| disable_etcd_called=1 | ||
| continue | ||
| fi | ||
|
|
||
| echo "${ETCD_CONTAINER_NAME} container is stopped" | ||
| return 0 | ||
| done | ||
|
|
||
| echo "timed out waiting for ${ETCD_CONTAINER_NAME} to stop" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
probably a good candidate for ./disable-etcd.sh as a TNF specific script
|
|
||
| # Extend restore flags with Pacemaker-specific arguments | ||
| ETCDCTL_RESTORE_FLAGS+=( | ||
| --skip-hash-check=true |
There was a problem hiding this comment.
this is definitely something you should keep and not skip
There was a problem hiding this comment.
podman-etcd also makes copies (actually using cp) of the etcd db when a learner starts, which seemed to be incompatible with the hash check.
I was indeed considering to give some "special" name to such backups and let the script decide whether to use or not the hash check
There was a problem hiding this comment.
which seemed to be incompatible with the hash check.
oof, that's probably something for us to take a deeper look at - I assume those are corrupted?
There was a problem hiding this comment.
It seems you were right about the snapshot being corrupted, as I haven't reproduced the error since
| --name="$NODENAME" | ||
| --initial-cluster="$NODENAME=https://$IP:2380" | ||
| --initial-advertise-peer-urls="https://$IP:2380" |
There was a problem hiding this comment.
I wonder if you really need this, given that you just go and force-new-cluster in the end
There was a problem hiding this comment.
yes, I tried first without and failed with some "mismatch" error, IIRC
| echo "could not wait for etcd resources to restart" | ||
| fi | ||
|
|
||
| cat << 'EOF' |
There was a problem hiding this comment.
this is actually a really awesome idea, thanks for adding it
| $ oc get co/etcd -w | ||
| (Wait to see if the etcd operator completes the rollout successfully) |
There was a problem hiding this comment.
there's also oc adm wait-for-stable-cluster
| Then monitor again: | ||
| $ oc get co/etcd -w | ||
|
|
||
| 3. IF revision.json is still not being re-created on this node after the patch, manually restart the static pods: |
There was a problem hiding this comment.
pkill cluster-etcd-operator might also do the trick, even though this is slightly less nice than what you're doing below ;)
| $ oc get co/etcd -w | ||
| (Wait to see if the etcd operator completes the rollout successfully) | ||
|
|
||
| 2. IF the rollout does NOT complete, apply the forceRedeploymentReason patch: |
There was a problem hiding this comment.
should we also apply the oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
There was a problem hiding this comment.
TNF clusters already have this set to allow external Etcd management
17397af to
ec62339
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/operator/scriptcontroller/scriptcontroller.go (1)
132-134: Use%wfor proper error wrapping.The error at line 133 uses
%vwhich loses the original error type. Using%wenables callers to useerrors.Is()anderrors.Unwrap().- if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil { - return nil, false, fmt.Errorf("failed to detect cluster topology: %v", err) + if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil { + return nil, false, fmt.Errorf("failed to detect cluster topology: %w", err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/etcd-common-toolspkg/operator/scriptcontroller/scriptcontroller_test.gobindata/etcd/disable-etcd-tnf.shbindata/etcd/cluster-restore.shpkg/operator/starter.gobindata/etcd/cluster-restore-tnf.shpkg/operator/scriptcontroller/scriptcontroller.go
🧬 Code graph analysis (2)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
🔇 Additional comments (16)
bindata/etcd/cluster-restore.sh (1)
143-143: LGTM!Clean addition of the post-restore completion message hook. The function is properly sourced from
etcd-common-toolsearlier in the script.bindata/etcd/disable-etcd-tnf.sh (1)
1-28: LGTM!Script structure follows the established patterns from other etcd scripts. Proper root check, strict mode settings, and dependency sourcing.
bindata/etcd/etcd-common-tools (2)
96-115: Robust implementation with clear timeout handling.The function properly handles the detection flow and timeout. One consideration: the
crm_resource --locatecheck may return an error when the resource doesn't exist vs. when it's stopped. The current logic handles this by continuing to retry, which is reasonable.
117-144: LGTM!Good use of single-quoted
'EOF'to prevent variable expansion in the heredoc, keeping the$()constructs literal for user execution. The step-by-step remediation instructions are helpful.pkg/operator/starter.go (1)
471-478: LGTM!Clean integration of the infrastructure lister into the script controller. This enables topology-aware script deployment (TNF vs standard) while reusing the existing informer infrastructure already initialized in
RunOperator.pkg/operator/scriptcontroller/scriptcontroller_test.go (3)
20-35: LGTM!Clean helper function that properly constructs test dependencies with fakes. Good separation of test setup from assertions.
37-117: Well-structured table-driven tests covering topology variants.Good coverage of the three topology modes with both positive (expected content) and negative (unexpected content) assertions. The verification of common scripts presence ensures shared functionality is not accidentally removed.
119-131: LGTM!Good error case coverage. Verifying the specific error message ensures the error is from the expected code path.
bindata/etcd/cluster-restore-tnf.sh (5)
43-49: Indirect variable expansion is valid but verifyNODENAMEis set.The
${!NODE_IP_VAR}indirect expansion works correctly in Bash. The function assumesNODENAMEis set before being called; this is handled insetup_pacemaker_restoreat line 73-77 which sets it with a default fromhostname.
51-68: Consider making the expected node count configurable.The check
[ "$(echo "$output" | wc -l)" -eq 2 ]assumes exactly 2 nodes for TNF. This is correct for dual-replica topology but hardcodes the assumption. Since this is a TNF-specific script, this is acceptable.
85-92: Verify--skip-hash-check=trueis intentional.Past review discussion indicates this was added due to podman-etcd making copies that are "incompatible with the hash check." If this is due to backup corruption, that's a deeper issue to investigate. If the copies are valid but have different hashes due to the copy mechanism, this is acceptable.
Based on past review comments, this appears to be a known workaround. Please confirm the backups are not corrupted and this is expected behavior for podman-etcd copies.
1-42: LGTM!Script header, strict mode settings, root check, and dependency sourcing follow the established patterns from other etcd scripts. The simplified usage message correctly reflects that TNF restore only needs the snapshot (not static pod resources).
113-165: LGTM!The main restore flow is well-structured with proper validation and error handling. Good safeguard at line 114-117 preventing BACKUP_DIR from being the same as ETCD_DATA_DIR.
pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: Good abstraction for testability.The
EnvVarGetterinterface cleanly abstracts the environment variable controller, enabling easier unit testing through mocking.
129-140: LGTM - topology-aware script selection is well structured.The conditional logic correctly selects TNF-specific or standard scripts based on the cluster topology, and the scripts are deployed under consistent names for downstream consumers.
152-154: All required script assets are present in bindata/etcd/: both TNF-specific (cluster-restore-tnf.sh,disable-etcd-tnf.sh) and standard (cluster-restore.sh,disable-etcd.sh) scripts exist and are properly referenced by the conditional topology detection logic.
ec62339 to
e46a48e
Compare
|
/retest-required |
e46a48e to
4873375
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)
131-134: Simplify ETCD_CLIENT initialization.The expression
${ETCD_ETCDCTL_BIN+etcdctl}always evaluates to"etcdctl"whenETCD_ETCDCTL_BINis set (which it always is frometcd-common-tools). This seems like it should reference the variable value instead.🔎 Apply this diff to fix the initialization:
-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}" +ETCD_CLIENT="${ETCD_ETCDCTL_BIN:-etcdctl}" if [ -n "${ETCD_ETCDUTL_BIN}" ]; then ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" fiThis uses the variable's value if set, or defaults to "etcdctl" if unset.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- bindata/etcd/cluster-restore.sh
- bindata/etcd/disable-etcd-tnf.sh
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/etcd-common-toolspkg/operator/scriptcontroller/scriptcontroller.gobindata/etcd/cluster-restore-tnf.shpkg/operator/scriptcontroller/scriptcontroller_test.gopkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
🔇 Additional comments (11)
bindata/etcd/etcd-common-tools (2)
96-115: LGTM! The wait function implementation is solid.The polling logic correctly handles three states: command failure (retry), non-empty location (etcd still running), and empty location (stopped). The 120-second timeout is reasonable for Pacemaker-managed etcd stop operations.
117-143: Clear and actionable completion message.The step-by-step instructions with concrete commands will help operators verify and troubleshoot the restore process. Good use of a heredoc for multi-line output.
pkg/operator/starter.go (1)
471-478: LGTM! Correct wiring of infrastructure lister and env var controller.The infrastructure lister is properly obtained from the config informers and passed alongside the existing
envVarControllerto enable topology-aware script deployment.pkg/operator/scriptcontroller/scriptcontroller_test.go (3)
20-35: LGTM! Clean test helper setup.The helper correctly initializes the controller with fake dependencies and a no-op enqueue function for testing.
37-117: Good topology coverage with bidirectional assertions.The test cases effectively verify that each topology gets its specific scripts while excluding content from other topologies. Testing both
expectedContentandunexpectedContentis a solid approach to catch incorrect script selection.
119-131: LGTM! Error handling test validates graceful failure.Good coverage for the missing environment variables scenario.
bindata/etcd/cluster-restore-tnf.sh (3)
143-155: LGTM! Data directory backup and restore sequence is well-structured.The script correctly backs up the existing member directory before removal, and creates a fresh data directory for the restore. The sequence prevents accidental data loss.
167-182: LGTM! Pacemaker integration for cluster bootstrap.The use of
crm_attributeforforce_new_clusterand subsequentpcs resource enablecorrectly implements the Pacemaker-based etcd restart flow. The decision to continue toprint_restore_completion_messageeven on timeout (per past review discussion) provides operators with guidance regardless of wait outcome.
51-68: The hardcoded check for exactly 2 etcd instances may not align with single-node initialization and lacks documentation.Line 58 checks
[ "$(echo "$output" | wc -l)" -eq 2 ], but the script initializes with a single-node cluster (--initial-cluster="$NODENAME=..."). This mismatch is unexplained. Either document why exactly 2 lines are expected (e.g., "TNF always has 2 nodes"), or make this check more robust by verifying>= 1if the goal is confirming at least one instance started.pkg/operator/scriptcontroller/scriptcontroller.go (2)
28-32: LGTM! Clean interface design.The
EnvVarGetterinterface properly abstracts the environment variable functionality needed by the controller, following Go interface conventions.
129-155: LGTM! Topology-aware script selection is well-implemented.The logic correctly:
- Detects TNF topology via
IsExternalEtcdCluster- Handles the error case appropriately
- Selects the correct script variants based on topology
- Deploys them under standard names (
cluster-restore.sh,disable-etcd.sh)This allows consumers to use consistent script names regardless of the underlying topology.
4873375 to
5804bac
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)
19-27: Consider consolidating duplicate helper functions.The
source_required_dependencyfunction is duplicated acrosscluster-restore-tnf.sh,disable-etcd-tnf.sh(lines 17-25), andcluster-restore.sh(lines 20-28). While this duplication ensures self-contained scripts, consider moving such common helpers toetcd-common-toolsif these functions evolve or grow in complexity.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- bindata/etcd/etcd-common-tools
- pkg/operator/scriptcontroller/scriptcontroller_test.go
- bindata/etcd/cluster-restore.sh
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/disable-etcd-tnf.shpkg/operator/scriptcontroller/scriptcontroller.gobindata/etcd/cluster-restore-tnf.shpkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore-tnf.sh (4)
43-49: LGTM!The variable indirection pattern with
${!NODE_IP_VAR}correctly retrieves the IP address from dynamically constructed environment variable names.
51-68: LGTM!The polling logic with a 5-minute timeout and XML parsing to detect etcd startup is appropriate. The check for exactly 2 lines on line 58 appears to be a topology-specific validation.
94-102: LGTM!The resource disable flow with
pcs resource disablefollowed bywait_for_podman_etcd_to_stop(from etcd-common-tools) properly ensures a clean shutdown before proceeding with the restore.
178-182: LGTM!The approach of printing the completion message even on timeout (lines 178-180) is intentional per previous discussion. This ensures operators always receive verification instructions, which is appropriate since timeout doesn't necessarily indicate restore failure.
pkg/operator/starter.go (1)
471-478: LGTM!The addition of the infrastructure lister at line 475 correctly wires topology information into the ScriptController, enabling topology-aware script deployment. This follows the established pattern of passing listers to controllers.
bindata/etcd/disable-etcd-tnf.sh (2)
1-15: LGTM!The script setup with strict shell options (
errexit,pipefail,errtrace) and root check follows best practices for operational scripts.
27-34: LGTM!The script correctly sources required dependencies and calls
wait_for_podman_etcd_to_stopfrom etcd-common-tools with appropriate error handling. The past review concern about the undefined function has been resolved.pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: LGTM!The
EnvVarGetterinterface provides a clean abstraction over environment variable access and listener management, improving testability and decoupling from concrete implementations.
34-40: LGTM!The refactoring to use
InfraListerandEnvVarGetterinterface (lines 37-38, 46-47, 53-54) properly enables topology detection while maintaining clean separation of concerns. The initialization is correct and follows established patterns.Also applies to: 42-55
129-154: LGTM!The topology-aware script selection logic is well-implemented:
- Line 132: Proper error handling for topology detection failure
- Lines 135-140: Clear conditional selection of TNF vs standard scripts
- Lines 142-150: Common scripts (quorum-restore.sh, cluster-backup.sh, etcd-common-tools) deployed regardless of topology
- Lines 153-154: Topology-specific scripts deployed under standardized names (cluster-restore.sh, disable-etcd.sh), ensuring consistent paths for consuming code
This approach maintains backward compatibility while supporting TNF deployments.
5804bac to
62a5b4c
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
bindata/etcd/disable-etcd-tnf.sh (1)
17-25: Consider extracting shared helper to reduce duplication.The
source_required_dependencyfunction is duplicated acrosscluster-restore.sh,cluster-restore-tnf.sh, and this file. While consistency is valuable, this duplication could lead to maintenance issues if the error handling logic needs to change.🔎 Optional: Move to etcd-common-tools
Consider moving this helper to
etcd-common-toolsso it can be shared across all scripts. However, this creates a chicken-and-egg problem (you need the helper to source the file that contains it), so the current duplication may be acceptable.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- bindata/etcd/etcd-common-tools
- pkg/operator/scriptcontroller/scriptcontroller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/operator/starter.gobindata/etcd/cluster-restore.shpkg/operator/scriptcontroller/scriptcontroller.gobindata/etcd/disable-etcd-tnf.shbindata/etcd/cluster-restore-tnf.sh
🧬 Code graph analysis (2)
bindata/etcd/disable-etcd-tnf.sh (2)
bindata/etcd/cluster-restore-tnf.sh (1)
source_required_dependency(19-27)bindata/etcd/cluster-restore.sh (1)
source_required_dependency(21-29)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore.sh (1)
143-143: LGTM!The addition of
print_restore_completion_messageat the end of the restore flow appropriately provides operators with post-restore guidance and verification instructions. The function is properly sourced frometcd-common-tools(line 32).pkg/operator/starter.go (1)
471-478: LGTM!Threading the Infrastructure lister into
ScriptControllerinitialization properly enables topology-aware script deployment. The lister is sourced from the already-initializedconfigInformersand aligns with the pattern used by other controllers in this file.bindata/etcd/disable-etcd-tnf.sh (1)
30-34: LGTM!The script correctly calls
wait_for_podman_etcd_to_stop(frometcd-common-tools) and handles failures appropriately with a clear error message and non-zero exit code.bindata/etcd/cluster-restore-tnf.sh (4)
70-110: LGTM!The
setup_pacemaker_restorefunction properly orchestrates the TNF-specific restore preparation:
- Validates required environment (hostname, IP address)
- Constructs appropriate etcdctl restore flags for single-member bootstrap
- Disables the Pacemaker etcd resource and waits for clean shutdown
- Safely moves configuration files with proper error handling
The array initialization at line 86 uses assignment (
=), not append, so no pre-initialization is needed.
142-164: LGTM!The backup and restore logic properly:
- Backs up existing data-dir with cleanup of old backups
- Completely removes and recreates data-dir for a clean restore
- Uses the appropriate etcd client binary (etcdctl or etcdutl)
- Includes a clear comment explaining why revision bumping is skipped
The error handling and success messaging provide good operator feedback.
166-181: LGTM!The restart and completion logic appropriately:
- Sets the
force_new_clusterattribute for single-member bootstrap- Re-enables the Pacemaker etcd resource
- Shows completion message regardless of wait timeout to provide verification instructions
The intentional non-exit on
wait_for_podman_etcd_startfailure (line 177-179) was confirmed in past reviews as correct behavior for providing operators with verification guidance even on timeout.
51-68: No action required. The 2-line check is appropriate for TNF's constrained topology.TNF clusters are validated to contain exactly 2 nodes by the configuration layer (
pkg/tnf/pkg/config/cluster.goenforceslen(nodes.Items) == 2). The pacemaker cluster setup is also hardcoded for 2 nodes. The check for exactly 2 lines in thepcs status xmloutput correctly verifies that both nodes have their etcd resource started. This is not a fragile assumption but an appropriate implementation for a topology that is constrained by design, not convention.pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: LGTM!The
EnvVarGetterinterface properly abstracts environment variable access and listener management, enabling clean dependency injection and testability. The interface methods align with their usage at lines 56 and 119.
42-56: LGTM!The constructor signature and initialization properly incorporate the infrastructure lister and environment variable getter. The listener registration (line 56) ensures the controller receives updates when environment variables change.
129-154: LGTM!The topology-aware script deployment logic properly:
- Detects cluster topology using
IsExternalEtcdClusterwith appropriate error handling- Selects TNF-specific scripts (
cluster-restore-tnf.sh,disable-etcd-tnf.sh) for external etcd clusters- Falls back to standard scripts for traditional deployments
- Deploys common scripts (quorum-restore, cluster-backup, etcd-common-tools) for both topologies
- Maps topology-specific scripts to standard names (
cluster-restore.sh,disable-etcd.sh) so consumers don't need topology awarenessThis design cleanly separates topology concerns from script consumers.
|
/retest-required |
|
There are some typos, found by CodeRabbit. I'll fix them next week |
6b1d510 to
207ce68
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @bindata/etcd/cluster-restore-tnf.sh:
- Around line 51-68: The wait_for_podman_etcd_start function currently requires
exactly two "Started" lines which fails for single-node restores; update the
check in wait_for_podman_etcd_start (the branch that inspects output from pcs
status xml | grep 'id="etcd".*role="Started"') to accept at least one started
instance (e.g., test that the line count is >= 1) instead of == 2 so the
function succeeds when only the local node has etcd enabled during restore.
- Around line 130-133: ETCD_CLIENT is set using the parameter expansion
`${ETCD_ETCDCTL_BIN+etcdctl}`, which is inconsistent with the simpler
direct-assignment pattern used by the check_snapshot_status function; change the
assignment to use a direct default value (e.g., set ETCD_CLIENT="etcdctl" first)
and then override it when ETCD_ETCDUTL_BIN is present (keep the existing if [ -n
"${ETCD_ETCDUTL_BIN}" ] branch), referencing the ETCD_CLIENT variable and the
ETCD_ETCDCTL_BIN / ETCD_ETCDUTL_BIN symbols so the logic matches
check_snapshot_status.
In @bindata/etcd/disable-etcd-tnf.sh:
- Around line 30-34: The script calls wait_for_podman_etcd_to_stop without first
disabling the Pacemaker resource; insert a pcs resource disable etcd call before
invoking wait_for_podman_etcd_to_stop so the etcd resource is instructed to stop
(match the pattern used in cluster-restore-tnf.sh), then proceed to echo and
exit on failure as currently done; update the block that contains
wait_for_podman_etcd_to_stop to run pcs resource disable etcd first.
🧹 Nitpick comments (2)
bindata/etcd/cluster-restore-tnf.sh (1)
118-124: Minor: Improve error message when snapshot not found.If the
lscommand finds no matching files,SNAPSHOT_FILEwill be empty, and the error message at line 122 will show an empty path, making it harder for operators to debug.♻️ Suggested improvement
# shellcheck disable=SC2012 SNAPSHOT_FILE=$(ls -vd "${BACKUP_DIR}"/snapshot*.db | tail -1) || true if [ ! -f "${SNAPSHOT_FILE}" ]; then - echo "etcd snapshot ${SNAPSHOT_FILE} does not exist" + echo "etcd snapshot not found in ${BACKUP_DIR} (expected snapshot*.db)" exit 1 fipkg/operator/scriptcontroller/scriptcontroller_test.go (1)
119-131: Consider adding a test for topology detection failure.The current tests cover missing env vars but not the case where
ceohelpers.IsExternalEtcdClusterreturns an error (e.g., infrastructure resource not found). This would improve test coverage for the error path at lines 132-133 inscriptcontroller.go.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.shbindata/etcd/cluster-restore.shbindata/etcd/disable-etcd-tnf.shbindata/etcd/etcd-common-toolspkg/operator/scriptcontroller/scriptcontroller.gopkg/operator/scriptcontroller/scriptcontroller_test.gopkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (2)
- bindata/etcd/cluster-restore.sh
- bindata/etcd/etcd-common-tools
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/cluster-restore-tnf.shpkg/operator/scriptcontroller/scriptcontroller.gopkg/operator/starter.gobindata/etcd/disable-etcd-tnf.shpkg/operator/scriptcontroller/scriptcontroller_test.go
🧬 Code graph analysis (4)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
bindata/etcd/disable-etcd-tnf.sh (2)
bindata/etcd/cluster-restore-tnf.sh (1)
source_required_dependency(19-27)bindata/etcd/cluster-restore.sh (1)
source_required_dependency(21-29)
pkg/operator/scriptcontroller/scriptcontroller_test.go (5)
pkg/operator/scriptcontroller/scriptcontroller.go (1)
ScriptController(34-40)pkg/testutils/testutils.go (1)
FakeInfrastructureLister(563-571)pkg/etcdenvvar/fake_envvar.go (1)
FakeEnvVar(3-6)pkg/operator/operatorclient/interfaces.go (1)
TargetNamespace(7-7)pkg/version/version.go (1)
Get(29-41)
🔇 Additional comments (4)
pkg/operator/starter.go (1)
471-478: LGTM!The infrastructure lister is correctly passed to enable topology-aware script deployment. The lister is already instantiated and used by other controllers in this file, so this addition is consistent with the existing patterns.
pkg/operator/scriptcontroller/scriptcontroller_test.go (1)
37-117: LGTM!Well-structured table-driven tests covering the three topology modes. The content assertions verify that topology-specific scripts are correctly selected based on infrastructure mode.
pkg/operator/scriptcontroller/scriptcontroller.go (2)
28-40: LGTM!Clean interface extraction for
EnvVarGetterthat improves testability. The struct fields are properly typed and the interface methods align with the actual usage.
129-154: LGTM!Topology-aware script deployment is cleanly implemented. TNF clusters get Pacemaker-specific scripts while standard topologies get the original scripts, both deployed under consistent names (
cluster-restore.sh,disable-etcd.sh).
| ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}" | ||
| if [ -n "${ETCD_ETCDUTL_BIN}" ]; then | ||
| ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's locate the file and examine its context
fd "cluster-restore" --type f
# Then search for ETCD_CLIENT assignments to understand the pattern
rg -n "ETCD_CLIENT=" --type sh -A 2 -B 2
# Also search for ETCD_ETCDCTL_BIN usage
rg -n "ETCD_ETCDCTL_BIN" --type sh -C 2Repository: openshift/cluster-etcd-operator
Length of output: 1688
🏁 Script executed:
#!/bin/bash
# Now let's understand how ETCD_CLIENT is used and where these variables are set
echo "=== How ETCD_CLIENT is used ==="
rg -n "ETCD_CLIENT" --type sh -A 1 -B 1
echo "=== Where ETCD_ETCDCTL_BIN is set ==="
rg -n "ETCD_ETCDCTL_BIN=" --type sh -A 1 -B 1
echo "=== Where ETCD_ETCDUTL_BIN is set ==="
rg -n "ETCD_ETCDUTL_BIN=" --type sh -A 1 -B 1Repository: openshift/cluster-etcd-operator
Length of output: 1560
🏁 Script executed:
#!/bin/bash
# Let's look at the entire files to understand context and see where these variables come from
echo "=== First 50 lines of cluster-restore.sh ==="
head -50 bindata/etcd/cluster-restore.sh
echo ""
echo "=== First 50 lines of cluster-restore-tnf.sh ==="
head -50 bindata/etcd/cluster-restore-tnf.sh
echo ""
echo "=== Search for where ETCD_ETCDCTL_BIN and ETCD_ETCDUTL_BIN might be exported or defined ==="
rg -n "export.*ETCD_ETCD" --type sh
rg -n "ETCD_ETCD.*BIN" --type sh | head -20Repository: openshift/cluster-etcd-operator
Length of output: 4040
🏁 Script executed:
#!/bin/bash
# Let's find and examine the etcd-common-tools file that defines these variables
fd "etcd-common-tools" --type f
# Also look for any files in the repo that define these binaries
rg -n "ETCD_ETCDCTL_BIN" --type sh -B 2 -A 2
rg -n "ETCD_ETCDUTL_BIN" --type sh -B 2 -A 2Repository: openshift/cluster-etcd-operator
Length of output: 1494
🏁 Script executed:
#!/bin/bash
# Let's examine the etcd-common-tools file to understand where these variables are defined
cat -n bindata/etcd/etcd-common-tools | head -200Repository: openshift/cluster-etcd-operator
Length of output: 5877
🏁 Script executed:
#!/bin/bash
# Let's examine the full context around where these are used in cluster-restore.sh
echo "=== Lines 75-90 in cluster-restore.sh (around ETCD_CLIENT assignment) ==="
sed -n '75,90p' bindata/etcd/cluster-restore.sh
echo ""
echo "=== Lines around ETCD_CLIENT usage in cluster-restore.sh ==="
sed -n '125,140p' bindata/etcd/cluster-restore.sh
echo ""
echo "=== Lines around ETCD_CLIENT usage in cluster-restore-tnf.sh ==="
sed -n '155,170p' bindata/etcd/cluster-restore-tnf.shRepository: openshift/cluster-etcd-operator
Length of output: 2403
🏁 Script executed:
#!/bin/bash
# Let's verify the actual bash behavior by testing the parameter expansion
bash <<'EOF'
# Test the actual behavior of the parameter expansion
ETCD_ETCDCTL_BIN="etcdctl"
ETCD_ETCDUTL_BIN="etcdutl"
echo "Testing parameter expansion syntax:"
echo "ETCD_ETCDCTL_BIN=${ETCD_ETCDCTL_BIN}"
echo "ETCD_ETCDUTL_BIN=${ETCD_ETCDUTL_BIN}"
# Test 1: The syntax used in cluster-restore.sh
ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
echo "Result with \${ETCD_ETCDCTL_BIN+etcdctl}: ETCD_CLIENT='${ETCD_CLIENT}'"
# Test 2: Direct assignment (used in check_snapshot_status)
ETCD_CLIENT="${ETCD_ETCDCTL_BIN}"
echo "Result with \${ETCD_ETCDCTL_BIN}: ETCD_CLIENT='${ETCD_CLIENT}'"
# Test 3: If ETCD_ETCDUTL_BIN is set
if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
fi
echo "After override with ETCD_ETCDUTL_BIN: ETCD_CLIENT='${ETCD_CLIENT}'"
EOFRepository: openshift/cluster-etcd-operator
Length of output: 338
Align variable assignment with check_snapshot_status pattern above.
The parameter expansion ${ETCD_ETCDCTL_BIN+etcdctl} is inconsistent with the direct assignment used in the check_snapshot_status function (lines 57-59 of etcd-common-tools). Both produce the same result since these variables are hardcoded to literal strings, but using direct assignment would be clearer:
-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
+ETCD_CLIENT="${ETCD_ETCDCTL_BIN}"
if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}" | |
| if [ -n "${ETCD_ETCDUTL_BIN}" ]; then | |
| ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" | |
| fi | |
| ETCD_CLIENT="${ETCD_ETCDCTL_BIN}" | |
| if [ -n "${ETCD_ETCDUTL_BIN}" ]; then | |
| ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" | |
| fi |
🤖 Prompt for AI Agents
In @bindata/etcd/cluster-restore-tnf.sh around lines 130 - 133, ETCD_CLIENT is
set using the parameter expansion `${ETCD_ETCDCTL_BIN+etcdctl}`, which is
inconsistent with the simpler direct-assignment pattern used by the
check_snapshot_status function; change the assignment to use a direct default
value (e.g., set ETCD_CLIENT="etcdctl" first) and then override it when
ETCD_ETCDUTL_BIN is present (keep the existing if [ -n "${ETCD_ETCDUTL_BIN}" ]
branch), referencing the ETCD_CLIENT variable and the ETCD_ETCDCTL_BIN /
ETCD_ETCDUTL_BIN symbols so the logic matches check_snapshot_status.
There was a problem hiding this comment.
To maintain consistency with the original script, this functionality was carried over to the TNF version
There was a problem hiding this comment.
@clobrano Understood. Maintaining consistency with the original cluster-restore.sh implementation is a valid approach for the TNF version. Thank you for the clarification.
Add dedicated etcd management scripts for Two Node with Fencing (TNF) deployments that use Pacemaker-managed Podman containers instead of Kubernetes static pods. Changes: - Add cluster-restore-tnf.sh with Pacemaker resource management - setup_pacemaker_restore() stops etcd resource and prepares restore flags - wait_for_podman_etcd_start() verifies etcd resource starts via pcs status - get_etcd_advertise_ip() detects advertise IP from etcd.env - Uses crm_attribute to set force_new_cluster on restart - Add disable-etcd-tnf.sh for disabling Pacemaker-managed etcd - Uses 'pcs resource disable etcd' instead of moving static pod manifests - Extend etcd-common-tools with shared TNF utilities - wait_for_podman_etcd_to_stop() waits for crm_resource to confirm stop - print_restore_completion_message() outputs rollout monitoring instructions These scripts will be conditionally deployed by the cluster-etcd-operator based on control plane topology detection.
Implement conditional script deployment in the script controller to automatically deploy TNF-specific or standard etcd scripts based on the detected control plane topology. This maintains backward compatibility while enabling automatic support for TNF deployments without manual script selection.
207ce68 to
1664b42
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @bindata/etcd/cluster-restore-tnf.sh:
- Around line 51-68: The wait_for_podman_etcd_start() function currently assumes
two started etcd resources by checking wc -l -eq 2, which fails for single-node
restores initialized in setup_pacemaker_restore(); update the logic to accept
one or more started etcd instances (e.g., check that the line count is >= 1) or
make the expected count configurable based on the restore scenario, so the
function returns success for a single-node start during restore while still
validating multi-node cases when appropriate.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.shbindata/etcd/cluster-restore.shbindata/etcd/disable-etcd-tnf.shbindata/etcd/etcd-common-toolspkg/operator/scriptcontroller/scriptcontroller.gopkg/operator/scriptcontroller/scriptcontroller_test.gopkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (3)
- bindata/etcd/disable-etcd-tnf.sh
- bindata/etcd/cluster-restore.sh
- bindata/etcd/etcd-common-tools
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/operator/scriptcontroller/scriptcontroller_test.gopkg/operator/scriptcontroller/scriptcontroller.gopkg/operator/starter.gobindata/etcd/cluster-restore-tnf.sh
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller_test.go (4)
pkg/operator/scriptcontroller/scriptcontroller.go (1)
ScriptController(34-40)pkg/testutils/testutils.go (1)
FakeInfrastructureLister(563-571)pkg/etcdenvvar/fake_envvar.go (1)
FakeEnvVar(3-6)pkg/operator/operatorclient/interfaces.go (1)
TargetNamespace(7-7)
bindata/etcd/cluster-restore-tnf.sh (1)
bindata/etcd/cluster-restore.sh (1)
usage(34-40)
🔇 Additional comments (12)
pkg/operator/starter.go (1)
471-478: LGTM - Infrastructure lister wiring follows existing patterns.The infrastructure lister is already available from
configInformersand is used consistently by other controllers in this file. This change correctly threads topology awareness to enable TNF-specific script deployment.pkg/operator/scriptcontroller/scriptcontroller_test.go (3)
20-35: LGTM - Test helper is well-structured.The helper correctly sets up the minimal dependencies needed to test
manageScriptConfigMap, using the existingFakeInfrastructureListerutility for topology simulation.
37-117: LGTM - Test coverage for topology-aware script deployment is comprehensive.The test cases effectively validate:
- TNF topology deploys Pacemaker-specific scripts (
pcs resource enable/disable etcd)- Standard topologies deploy static-pod scripts (
wait_for_containers_to_stop,mv_static_pods)- Common scripts are always present regardless of topology
The negative assertions (
NotContains) provide good protection against script content leaking across topology boundaries.
119-131: LGTM - Error handling test validates graceful failure.Good coverage of the error path when required environment variables are missing.
bindata/etcd/cluster-restore-tnf.sh (5)
1-17: LGTM - Script setup follows best practices.Appropriate use of
errexit,pipefail,errtracefor robust error handling, and root check is necessary for Pacemaker and etcd operations.
19-31: LGTM - Dependency sourcing with proper validation.
32-49: LGTM - Usage and IP detection are appropriate for TNF context.The usage correctly omits the static pod resources reference (per standard script) since TNF uses Pacemaker. The IP detection properly leverages the
NODE_<nodename>_IPpattern frometcd.env.
70-110: LGTM - Pacemaker restore setup is well-structured.The function properly:
- Validates required variables (NODENAME, IP) with helpful error messages
- Disables etcd resource before manipulation
- Waits for clean stop before proceeding
- Handles missing config files gracefully
The
|| echo "Warning: ..."pattern at line 108 is acceptable since these are non-critical config files being moved to backup.
160-183: LGTM - Restore execution and completion handling.The restore flow correctly:
- Exits on snapshot restore failure (critical)
- Sets
force_new_clusterattribute for single-member bootstrap- Treats wait timeout as non-fatal with helpful messaging (allows admin to troubleshoot)
- Always prints completion message for admin guidance
pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: LGTM - Clean interface extraction for testability.The
EnvVarGetterinterface properly abstracts the environment variable controller, enabling mock implementations in tests while preserving the listener registration pattern for re-queuing on changes.
34-55: LGTM - Controller struct and constructor updates.The changes properly introduce infrastructure awareness while maintaining backward compatibility with existing controller patterns.
129-155: LGTM - Topology-aware script selection is well-implemented.The logic correctly:
- Detects TNF topology via
ceohelpers.IsExternalEtcdCluster- Selects appropriate script assets based on topology
- Deploys scripts with consistent names (
cluster-restore.sh,disable-etcd.sh) regardless of source- Handles topology detection errors explicitly
|
/retest-required |
|
/lgtm /override ci/prow/e2e-aws-ovn-serial-2of2 |
|
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial-2of2 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@clobrano: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@clobrano: This pull request references Jira Issue OCPBUGS-60588, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@clobrano: This pull request references Jira Issue OCPBUGS-60588, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by @clobrano |
|
@clobrano: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@clobrano: Jira Issue Verification Checks: Jira Issue OCPBUGS-60588 Jira Issue OCPBUGS-60588 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-01-19-085729 |
Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.
The script preserves existing static pod restore behavior when Pacemaker is not detected.
Restore process for Pacemaker-managed etcd:
New environment variables: