OCPBUGS-60588: [TNF] support restore for pacemaker-managed etcd by clobrano · Pull Request #1521 · openshift/cluster-etcd-operator

clobrano · 2025-12-10T13:14:53Z

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]
Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]
Backup existing etcd data directory [existing]
Move podman-etcd configuration files to backup location [new]
Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]
Set force_new_cluster attribute for single-member bootstrap [new]
Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode
ETCD_CONTAINER_NAME: Podman container name (default: "etcd")
ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

openshift-ci · 2025-12-10T13:14:57Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-12-10T13:15:00Z

Walkthrough

Adds TNF-specific etcd restore/disable scripts and helpers, extends etcd-common-tools with stop/wait and completion messaging, makes ScriptController deploy topology-aware scripts using an Infrastructure lister and EnvVarGetter, and adds unit tests for the script deployment behavior.

Changes

Cohort / File(s)	Summary
TNF-specific etcd scripts `bindata/etcd/cluster-restore-tnf.sh`, `bindata/etcd/disable-etcd-tnf.sh`	New Bash scripts for Pacemaker/TNF environments: `cluster-restore-tnf.sh` orchestrates snapshot restore (root check, deps, advertise IP, etcdctl, snapshot restore, pacemaker interactions, backups, error handling); `disable-etcd-tnf.sh` disables the etcd Pacemaker resource and waits for podman-etcd to stop.
Etcd common tools & restore message `bindata/etcd/etcd-common-tools`, `bindata/etcd/cluster-restore.sh`	Added `wait_for_podman_etcd_to_stop()` (120s polling via `crm_resource --locate`) and `print_restore_completion_message()`; appended call to `print_restore_completion_message` in `cluster-restore.sh`.
ScriptController topology-aware deployment `pkg/operator/scriptcontroller/scriptcontroller.go`	Introduced `EnvVarGetter` interface, replaced concrete env-var controller field, added `infraLister` (InfrastructureLister) and constructor arg; logic now detects topology (TNF vs standard) and deploys appropriate `cluster-restore.sh`/`disable-etcd.sh` contents while still deploying common scripts.
Unit tests for script controller `pkg/operator/scriptcontroller/scriptcontroller_test.go`	Added `getTestController` helper and tests: `TestManageScriptConfigMap` (DualReplica, HighlyAvailable, SingleReplica) verifying created ConfigMap contents and topology-specific script selection; `TestManageScriptConfigMap_MissingEnvVars` verifies error on missing env vars.
Operator startup wiring `pkg/operator/starter.go`	Passes `configInformers.Config().V1().Infrastructures().Lister()` into `NewScriptControllerController` to thread the Infrastructure lister through startup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches

📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tjungblu · 2025-12-10T14:46:57Z

+    return 0
+  fi
+
+  if systemctl status pacemaker >/dev/null 2>&1 ; then


this is smart, until somebody else comes around and uses pacemaker on the control plane 🤣
Can we make this more specific? there is likely the infra CRD somewhere on /etc/kubernetes/manifest.

Otherwise, this is also a advantage of having your own restore script installed by the operator only on TNF - that gives you the guarantee that you definitely run on TNF, allowing you to remove this altogether

tjungblu · 2025-12-10T14:47:31Z

+function get_etcd_advertise_ip() {
+  # Try to detect IP from etcd.env (NODE_<nodename>_IP variable)
+  NODENAME_UNDERSCORE=$(echo "${NODENAME}" | tr '-' '_')
+  NODE_IP_VAR="NODE_${NODENAME_UNDERSCORE}_IP"
+  IP="${!NODE_IP_VAR}"
+  echo "$IP"
+}


wouldn't it always be localhost:2379?

tjungblu · 2025-12-10T14:49:50Z

+  done
+
+  echo "timed out waiting for etcd resources to start (timeout: $timeout seconds)"
+  return 1


looks like you would want to exit 1 here directly, right?

you also don't exit on the return of a 1 below

tjungblu · 2025-12-10T14:51:01Z

+function ensure_etcd_container_stopped() {
+  echo "ensure ${ETCD_CONTAINER_NAME} container is stopped"
+  local start=$SECONDS
+  local timeout=120
+  local disable_etcd_called=0
+  while [ $((SECONDS - start)) -lt "$timeout" ]; do
+    if podman container exists "${ETCD_CONTAINER_NAME}" && \
+      [[ "$(podman inspect -f '{{.State.Status}}' "${ETCD_CONTAINER_NAME}")" != "exited" ]]; then
+
+      if [ "$disable_etcd_called" -eq 1 ]; then
+        sleep 1
+        continue
+      fi
+
+      if ! pcs resource disable etcd; then
+        echo "could not disable podman-etcd. Please ensure the ${ETCD_CONTAINER_NAME} container is stopped"
+        exit 1
+      fi
+
+      disable_etcd_called=1
+      continue
+    fi
+
+    echo "${ETCD_CONTAINER_NAME} container is stopped"
+    return 0
+  done
+
+  echo "timed out waiting for ${ETCD_CONTAINER_NAME} to stop"
+  return 1
+}


probably a good candidate for ./disable-etcd.sh as a TNF specific script

tjungblu · 2025-12-10T14:53:28Z

+
+  # Extend restore flags with Pacemaker-specific arguments
+  ETCDCTL_RESTORE_FLAGS+=(
+    --skip-hash-check=true


this is definitely something you should keep and not skip

podman-etcd also makes copies (actually using cp) of the etcd db when a learner starts, which seemed to be incompatible with the hash check.
I was indeed considering to give some "special" name to such backups and let the script decide whether to use or not the hash check

which seemed to be incompatible with the hash check.

oof, that's probably something for us to take a deeper look at - I assume those are corrupted?

It seems you were right about the snapshot being corrupted, as I haven't reproduced the error since

tjungblu · 2025-12-10T14:54:45Z

+    --name="$NODENAME"
+    --initial-cluster="$NODENAME=https://$IP:2380"
+    --initial-advertise-peer-urls="https://$IP:2380"


I wonder if you really need this, given that you just go and force-new-cluster in the end

yes, I tried first without and failed with some "mismatch" error, IIRC

tjungblu · 2025-12-10T14:56:20Z

+      echo "could not wait for etcd resources to restart"
+    fi
+
+    cat << 'EOF'


this is actually a really awesome idea, thanks for adding it

tjungblu · 2025-12-10T15:07:42Z

+    $ oc get co/etcd -w
+    (Wait to see if the etcd operator completes the rollout successfully)


there's also oc adm wait-for-stable-cluster

tjungblu · 2025-12-10T15:09:30Z

+    Then monitor again:
+    $ oc get co/etcd -w
+
+3. IF revision.json is still not being re-created on this node after the patch, manually restart the static pods:


pkill cluster-etcd-operator might also do the trick, even though this is slightly less nice than what you're doing below ;)

lance5890 · 2025-12-11T01:24:21Z

+    $ oc get co/etcd -w
+    (Wait to see if the etcd operator completes the rollout successfully)
+
+2. IF the rollout does NOT complete, apply the forceRedeploymentReason patch:


should we also apply the oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'

TNF clusters already have this set to allow external Etcd management

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/operator/scriptcontroller/scriptcontroller.go (1)
132-134: Use %w for proper error wrapping.

The error at line 133 uses %v which loses the original error type. Using %w enables callers to use errors.Is() and errors.Unwrap().
-	if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil {
-		return nil, false, fmt.Errorf("failed to detect cluster topology: %v", err)
+	if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil {
+		return nil, false, fmt.Errorf("failed to detect cluster topology: %w", err)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e46bfd and ec62339.

📒 Files selected for processing (7)

bindata/etcd/cluster-restore-tnf.sh (1 hunks)
bindata/etcd/cluster-restore.sh (1 hunks)
bindata/etcd/disable-etcd-tnf.sh (1 hunks)
bindata/etcd/etcd-common-tools (1 hunks)
pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
pkg/operator/starter.go (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

bindata/etcd/etcd-common-tools
pkg/operator/scriptcontroller/scriptcontroller_test.go
bindata/etcd/disable-etcd-tnf.sh
bindata/etcd/cluster-restore.sh
pkg/operator/starter.go
bindata/etcd/cluster-restore-tnf.sh
pkg/operator/scriptcontroller/scriptcontroller.go

🧬 Code graph analysis (2)

bindata/etcd/cluster-restore-tnf.sh (2)

bindata/etcd/cluster-restore.sh (2)

source_required_dependency (21-29)

usage (34-40)

bindata/etcd/disable-etcd-tnf.sh (1)

source_required_dependency (17-25)

pkg/operator/scriptcontroller/scriptcontroller.go (3)

pkg/etcdenvvar/envvarcontroller.go (1)

Enqueueable (65-67)

pkg/operator/ceohelpers/external_etcd_status.go (1)

IsExternalEtcdCluster (23-35)

bindata/assets.go (1)

MustAsset (15-21)

🔇 Additional comments (16)

bindata/etcd/cluster-restore.sh (1)

143-143: LGTM!

Clean addition of the post-restore completion message hook. The function is properly sourced from etcd-common-tools earlier in the script.

bindata/etcd/disable-etcd-tnf.sh (1)

1-28: LGTM!

Script structure follows the established patterns from other etcd scripts. Proper root check, strict mode settings, and dependency sourcing.

bindata/etcd/etcd-common-tools (2)

96-115: Robust implementation with clear timeout handling.

The function properly handles the detection flow and timeout. One consideration: the crm_resource --locate check may return an error when the resource doesn't exist vs. when it's stopped. The current logic handles this by continuing to retry, which is reasonable.

117-144: LGTM!

Good use of single-quoted 'EOF' to prevent variable expansion in the heredoc, keeping the $() constructs literal for user execution. The step-by-step remediation instructions are helpful.

pkg/operator/starter.go (1)

471-478: LGTM!

Clean integration of the infrastructure lister into the script controller. This enables topology-aware script deployment (TNF vs standard) while reusing the existing informer infrastructure already initialized in RunOperator.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM!

Clean helper function that properly constructs test dependencies with fakes. Good separation of test setup from assertions.

37-117: Well-structured table-driven tests covering topology variants.

Good coverage of the three topology modes with both positive (expected content) and negative (unexpected content) assertions. The verification of common scripts presence ensures shared functionality is not accidentally removed.

119-131: LGTM!

Good error case coverage. Verifying the specific error message ensures the error is from the expected code path.

bindata/etcd/cluster-restore-tnf.sh (5)

43-49: Indirect variable expansion is valid but verify NODENAME is set.

The ${!NODE_IP_VAR} indirect expansion works correctly in Bash. The function assumes NODENAME is set before being called; this is handled in setup_pacemaker_restore at line 73-77 which sets it with a default from hostname.

51-68: Consider making the expected node count configurable.

The check [ "$(echo "$output" | wc -l)" -eq 2 ] assumes exactly 2 nodes for TNF. This is correct for dual-replica topology but hardcodes the assumption. Since this is a TNF-specific script, this is acceptable.

85-92: Verify --skip-hash-check=true is intentional.

Past review discussion indicates this was added due to podman-etcd making copies that are "incompatible with the hash check." If this is due to backup corruption, that's a deeper issue to investigate. If the copies are valid but have different hashes due to the copy mechanism, this is acceptable.

Based on past review comments, this appears to be a known workaround. Please confirm the backups are not corrupted and this is expected behavior for podman-etcd copies.

1-42: LGTM!

Script header, strict mode settings, root check, and dependency sourcing follow the established patterns from other etcd scripts. The simplified usage message correctly reflects that TNF restore only needs the snapshot (not static pod resources).

113-165: LGTM!

The main restore flow is well-structured with proper validation and error handling. Good safeguard at line 114-117 preventing BACKUP_DIR from being the same as ETCD_DATA_DIR.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: Good abstraction for testability.

The EnvVarGetter interface cleanly abstracts the environment variable controller, enabling easier unit testing through mocking.

129-140: LGTM - topology-aware script selection is well structured.

The conditional logic correctly selects TNF-specific or standard scripts based on the cluster topology, and the scripts are deployed under consistent names for downstream consumers.

152-154: All required script assets are present in bindata/etcd/: both TNF-specific (cluster-restore-tnf.sh, disable-etcd-tnf.sh) and standard (cluster-restore.sh, disable-etcd.sh) scripts exist and are properly referenced by the conditional topology detection logic.

clobrano · 2025-12-16T15:47:24Z

/retest-required

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

bindata/etcd/cluster-restore-tnf.sh (1)
131-134: Simplify ETCD_CLIENT initialization.

The expression ${ETCD_ETCDCTL_BIN+etcdctl} always evaluates to "etcdctl" when ETCD_ETCDCTL_BIN is set (which it always is from etcd-common-tools). This seems like it should reference the variable value instead.
🔎 Apply this diff to fix the initialization:
-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
+ETCD_CLIENT="${ETCD_ETCDCTL_BIN:-etcdctl}"
 if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
   ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
 fi
This uses the variable's value if set, or defaults to "etcdctl" if unset.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e46a48e and 4873375.

📒 Files selected for processing (7)

bindata/etcd/cluster-restore-tnf.sh (1 hunks)
bindata/etcd/cluster-restore.sh (1 hunks)
bindata/etcd/disable-etcd-tnf.sh (1 hunks)
bindata/etcd/etcd-common-tools (1 hunks)
pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
pkg/operator/starter.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

bindata/etcd/cluster-restore.sh
bindata/etcd/disable-etcd-tnf.sh

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

bindata/etcd/etcd-common-tools
pkg/operator/scriptcontroller/scriptcontroller.go
bindata/etcd/cluster-restore-tnf.sh
pkg/operator/scriptcontroller/scriptcontroller_test.go
pkg/operator/starter.go

🧬 Code graph analysis (2)

pkg/operator/scriptcontroller/scriptcontroller.go (3)

pkg/etcdenvvar/envvarcontroller.go (1)

Enqueueable (65-67)

pkg/operator/ceohelpers/external_etcd_status.go (1)

IsExternalEtcdCluster (23-35)

bindata/assets.go (1)

MustAsset (15-21)

bindata/etcd/cluster-restore-tnf.sh (2)

bindata/etcd/cluster-restore.sh (2)

source_required_dependency (21-29)

usage (34-40)

bindata/etcd/disable-etcd-tnf.sh (1)

source_required_dependency (17-25)

🔇 Additional comments (11)

bindata/etcd/etcd-common-tools (2)

96-115: LGTM! The wait function implementation is solid.

The polling logic correctly handles three states: command failure (retry), non-empty location (etcd still running), and empty location (stopped). The 120-second timeout is reasonable for Pacemaker-managed etcd stop operations.

117-143: Clear and actionable completion message.

The step-by-step instructions with concrete commands will help operators verify and troubleshoot the restore process. Good use of a heredoc for multi-line output.

pkg/operator/starter.go (1)

471-478: LGTM! Correct wiring of infrastructure lister and env var controller.

The infrastructure lister is properly obtained from the config informers and passed alongside the existing envVarController to enable topology-aware script deployment.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM! Clean test helper setup.

The helper correctly initializes the controller with fake dependencies and a no-op enqueue function for testing.

37-117: Good topology coverage with bidirectional assertions.

The test cases effectively verify that each topology gets its specific scripts while excluding content from other topologies. Testing both expectedContent and unexpectedContent is a solid approach to catch incorrect script selection.

119-131: LGTM! Error handling test validates graceful failure.

Good coverage for the missing environment variables scenario.

bindata/etcd/cluster-restore-tnf.sh (3)

143-155: LGTM! Data directory backup and restore sequence is well-structured.

The script correctly backs up the existing member directory before removal, and creates a fresh data directory for the restore. The sequence prevents accidental data loss.

167-182: LGTM! Pacemaker integration for cluster bootstrap.

The use of crm_attribute for force_new_cluster and subsequent pcs resource enable correctly implements the Pacemaker-based etcd restart flow. The decision to continue to print_restore_completion_message even on timeout (per past review discussion) provides operators with guidance regardless of wait outcome.

51-68: The hardcoded check for exactly 2 etcd instances may not align with single-node initialization and lacks documentation.

Line 58 checks [ "$(echo "$output" | wc -l)" -eq 2 ], but the script initializes with a single-node cluster (--initial-cluster="$NODENAME=..."). This mismatch is unexplained. Either document why exactly 2 lines are expected (e.g., "TNF always has 2 nodes"), or make this check more robust by verifying >= 1 if the goal is confirming at least one instance started.

pkg/operator/scriptcontroller/scriptcontroller.go (2)

28-32: LGTM! Clean interface design.

The EnvVarGetter interface properly abstracts the environment variable functionality needed by the controller, following Go interface conventions.

129-155: LGTM! Topology-aware script selection is well-implemented.

The logic correctly:

Detects TNF topology via IsExternalEtcdCluster

Handles the error case appropriately

Selects the correct script variants based on topology

Deploys them under standard names (cluster-restore.sh, disable-etcd.sh)

This allows consumers to use consistent script names regardless of the underlying topology.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

bindata/etcd/cluster-restore-tnf.sh (1)

19-27: Consider consolidating duplicate helper functions.

The source_required_dependency function is duplicated across cluster-restore-tnf.sh, disable-etcd-tnf.sh (lines 17-25), and cluster-restore.sh (lines 20-28). While this duplication ensures self-contained scripts, consider moving such common helpers to etcd-common-tools if these functions evolve or grow in complexity.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4873375 and 5804bac.

📒 Files selected for processing (7)

bindata/etcd/cluster-restore-tnf.sh (1 hunks)
bindata/etcd/cluster-restore.sh (1 hunks)
bindata/etcd/disable-etcd-tnf.sh (1 hunks)
bindata/etcd/etcd-common-tools (1 hunks)
pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
pkg/operator/starter.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

bindata/etcd/etcd-common-tools
pkg/operator/scriptcontroller/scriptcontroller_test.go
bindata/etcd/cluster-restore.sh

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

bindata/etcd/disable-etcd-tnf.sh
pkg/operator/scriptcontroller/scriptcontroller.go
bindata/etcd/cluster-restore-tnf.sh
pkg/operator/starter.go

🧬 Code graph analysis (2)

pkg/operator/scriptcontroller/scriptcontroller.go (3)

pkg/etcdenvvar/envvarcontroller.go (1)

Enqueueable (65-67)

pkg/operator/ceohelpers/external_etcd_status.go (1)

IsExternalEtcdCluster (23-35)

bindata/assets.go (1)

MustAsset (15-21)

bindata/etcd/cluster-restore-tnf.sh (2)

bindata/etcd/cluster-restore.sh (2)

source_required_dependency (21-29)

usage (34-40)

bindata/etcd/disable-etcd-tnf.sh (1)

source_required_dependency (17-25)

🔇 Additional comments (10)

bindata/etcd/cluster-restore-tnf.sh (4)

43-49: LGTM!

The variable indirection pattern with ${!NODE_IP_VAR} correctly retrieves the IP address from dynamically constructed environment variable names.

51-68: LGTM!

The polling logic with a 5-minute timeout and XML parsing to detect etcd startup is appropriate. The check for exactly 2 lines on line 58 appears to be a topology-specific validation.

94-102: LGTM!

The resource disable flow with pcs resource disable followed by wait_for_podman_etcd_to_stop (from etcd-common-tools) properly ensures a clean shutdown before proceeding with the restore.

178-182: LGTM!

The approach of printing the completion message even on timeout (lines 178-180) is intentional per previous discussion. This ensures operators always receive verification instructions, which is appropriate since timeout doesn't necessarily indicate restore failure.

pkg/operator/starter.go (1)

471-478: LGTM!

The addition of the infrastructure lister at line 475 correctly wires topology information into the ScriptController, enabling topology-aware script deployment. This follows the established pattern of passing listers to controllers.

bindata/etcd/disable-etcd-tnf.sh (2)

1-15: LGTM!

The script setup with strict shell options (errexit, pipefail, errtrace) and root check follows best practices for operational scripts.

27-34: LGTM!

The script correctly sources required dependencies and calls wait_for_podman_etcd_to_stop from etcd-common-tools with appropriate error handling. The past review concern about the undefined function has been resolved.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM!

The EnvVarGetter interface provides a clean abstraction over environment variable access and listener management, improving testability and decoupling from concrete implementations.

34-40: LGTM!

The refactoring to use InfraLister and EnvVarGetter interface (lines 37-38, 46-47, 53-54) properly enables topology detection while maintaining clean separation of concerns. The initialization is correct and follows established patterns.

Also applies to: 42-55

129-154: LGTM!

The topology-aware script selection logic is well-implemented:

Line 132: Proper error handling for topology detection failure

Lines 135-140: Clear conditional selection of TNF vs standard scripts

Lines 142-150: Common scripts (quorum-restore.sh, cluster-backup.sh, etcd-common-tools) deployed regardless of topology

Lines 153-154: Topology-specific scripts deployed under standardized names (cluster-restore.sh, disable-etcd.sh), ensuring consistent paths for consuming code

This approach maintains backward compatibility while supporting TNF deployments.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

bindata/etcd/disable-etcd-tnf.sh (1)

17-25: Consider extracting shared helper to reduce duplication.

The source_required_dependency function is duplicated across cluster-restore.sh, cluster-restore-tnf.sh, and this file. While consistency is valuable, this duplication could lead to maintenance issues if the error handling logic needs to change.

🔎 Optional: Move to etcd-common-tools

Consider moving this helper to etcd-common-tools so it can be shared across all scripts. However, this creates a chicken-and-egg problem (you need the helper to source the file that contains it), so the current duplication may be acceptable.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 5804bac and 62a5b4c.

📒 Files selected for processing (7)

bindata/etcd/cluster-restore-tnf.sh (1 hunks)
bindata/etcd/cluster-restore.sh (1 hunks)
bindata/etcd/disable-etcd-tnf.sh (1 hunks)
bindata/etcd/etcd-common-tools (1 hunks)
pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
pkg/operator/starter.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

bindata/etcd/etcd-common-tools
pkg/operator/scriptcontroller/scriptcontroller_test.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/operator/starter.go
bindata/etcd/cluster-restore.sh
pkg/operator/scriptcontroller/scriptcontroller.go
bindata/etcd/disable-etcd-tnf.sh
bindata/etcd/cluster-restore-tnf.sh

🧬 Code graph analysis (2)

bindata/etcd/disable-etcd-tnf.sh (2)

bindata/etcd/cluster-restore-tnf.sh (1)

source_required_dependency (19-27)

bindata/etcd/cluster-restore.sh (1)

source_required_dependency (21-29)

bindata/etcd/cluster-restore-tnf.sh (2)

bindata/etcd/cluster-restore.sh (2)

source_required_dependency (21-29)

usage (34-40)

bindata/etcd/disable-etcd-tnf.sh (1)

source_required_dependency (17-25)

🔇 Additional comments (10)

bindata/etcd/cluster-restore.sh (1)

143-143: LGTM!

The addition of print_restore_completion_message at the end of the restore flow appropriately provides operators with post-restore guidance and verification instructions. The function is properly sourced from etcd-common-tools (line 32).

pkg/operator/starter.go (1)

471-478: LGTM!

Threading the Infrastructure lister into ScriptController initialization properly enables topology-aware script deployment. The lister is sourced from the already-initialized configInformers and aligns with the pattern used by other controllers in this file.

bindata/etcd/disable-etcd-tnf.sh (1)

30-34: LGTM!

The script correctly calls wait_for_podman_etcd_to_stop (from etcd-common-tools) and handles failures appropriately with a clear error message and non-zero exit code.

bindata/etcd/cluster-restore-tnf.sh (4)

70-110: LGTM!

The setup_pacemaker_restore function properly orchestrates the TNF-specific restore preparation:

Validates required environment (hostname, IP address)

Constructs appropriate etcdctl restore flags for single-member bootstrap

Disables the Pacemaker etcd resource and waits for clean shutdown

Safely moves configuration files with proper error handling

The array initialization at line 86 uses assignment (=), not append, so no pre-initialization is needed.

142-164: LGTM!

The backup and restore logic properly:

Backs up existing data-dir with cleanup of old backups

Completely removes and recreates data-dir for a clean restore

Uses the appropriate etcd client binary (etcdctl or etcdutl)

Includes a clear comment explaining why revision bumping is skipped

The error handling and success messaging provide good operator feedback.

166-181: LGTM!

The restart and completion logic appropriately:

Sets the force_new_cluster attribute for single-member bootstrap

Re-enables the Pacemaker etcd resource

Shows completion message regardless of wait timeout to provide verification instructions

The intentional non-exit on wait_for_podman_etcd_start failure (line 177-179) was confirmed in past reviews as correct behavior for providing operators with verification guidance even on timeout.

51-68: No action required. The 2-line check is appropriate for TNF's constrained topology.

TNF clusters are validated to contain exactly 2 nodes by the configuration layer (pkg/tnf/pkg/config/cluster.go enforces len(nodes.Items) == 2). The pacemaker cluster setup is also hardcoded for 2 nodes. The check for exactly 2 lines in the pcs status xml output correctly verifies that both nodes have their etcd resource started. This is not a fragile assumption but an appropriate implementation for a topology that is constrained by design, not convention.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM!

The EnvVarGetter interface properly abstracts environment variable access and listener management, enabling clean dependency injection and testability. The interface methods align with their usage at lines 56 and 119.

42-56: LGTM!

The constructor signature and initialization properly incorporate the infrastructure lister and environment variable getter. The listener registration (line 56) ensures the controller receives updates when environment variables change.

129-154: LGTM!

The topology-aware script deployment logic properly:

Detects cluster topology using IsExternalEtcdCluster with appropriate error handling

Selects TNF-specific scripts (cluster-restore-tnf.sh, disable-etcd-tnf.sh) for external etcd clusters

Falls back to standard scripts for traditional deployments

Deploys common scripts (quorum-restore, cluster-backup, etcd-common-tools) for both topologies

Maps topology-specific scripts to standard names (cluster-restore.sh, disable-etcd.sh) so consumers don't need topology awareness

This design cleanly separates topology concerns from script consumers.

clobrano · 2025-12-18T12:36:38Z

/retest-required

clobrano · 2026-01-09T18:33:55Z

There are some typos, found by CodeRabbit. I'll fix them next week

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @bindata/etcd/cluster-restore-tnf.sh:
- Around line 51-68: The wait_for_podman_etcd_start function currently requires
exactly two "Started" lines which fails for single-node restores; update the
check in wait_for_podman_etcd_start (the branch that inspects output from pcs
status xml | grep 'id="etcd".*role="Started"') to accept at least one started
instance (e.g., test that the line count is >= 1) instead of == 2 so the
function succeeds when only the local node has etcd enabled during restore.
- Around line 130-133: ETCD_CLIENT is set using the parameter expansion
`${ETCD_ETCDCTL_BIN+etcdctl}`, which is inconsistent with the simpler
direct-assignment pattern used by the check_snapshot_status function; change the
assignment to use a direct default value (e.g., set ETCD_CLIENT="etcdctl" first)
and then override it when ETCD_ETCDUTL_BIN is present (keep the existing if [ -n
"${ETCD_ETCDUTL_BIN}" ] branch), referencing the ETCD_CLIENT variable and the
ETCD_ETCDCTL_BIN / ETCD_ETCDUTL_BIN symbols so the logic matches
check_snapshot_status.

In @bindata/etcd/disable-etcd-tnf.sh:
- Around line 30-34: The script calls wait_for_podman_etcd_to_stop without first
disabling the Pacemaker resource; insert a pcs resource disable etcd call before
invoking wait_for_podman_etcd_to_stop so the etcd resource is instructed to stop
(match the pattern used in cluster-restore-tnf.sh), then proceed to echo and
exit on failure as currently done; update the block that contains
wait_for_podman_etcd_to_stop to run pcs resource disable etcd first.

🧹 Nitpick comments (2)

bindata/etcd/cluster-restore-tnf.sh (1)
118-124: Minor: Improve error message when snapshot not found.

If the ls command finds no matching files, SNAPSHOT_FILE will be empty, and the error message at line 122 will show an empty path, making it harder for operators to debug.
♻️ Suggested improvement
 # shellcheck disable=SC2012
 SNAPSHOT_FILE=$(ls -vd "${BACKUP_DIR}"/snapshot*.db | tail -1) || true
 
 if [ ! -f "${SNAPSHOT_FILE}" ]; then
-  echo "etcd snapshot ${SNAPSHOT_FILE} does not exist"
+  echo "etcd snapshot not found in ${BACKUP_DIR} (expected snapshot*.db)"
   exit 1
 fi
pkg/operator/scriptcontroller/scriptcontroller_test.go (1)

119-131: Consider adding a test for topology detection failure.

The current tests cover missing env vars but not the case where ceohelpers.IsExternalEtcdCluster returns an error (e.g., infrastructure resource not found). This would improve test coverage for the error path at lines 132-133 in scriptcontroller.go.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6b1d510 and 207ce68.

📒 Files selected for processing (7)

bindata/etcd/cluster-restore-tnf.sh
bindata/etcd/cluster-restore.sh
bindata/etcd/disable-etcd-tnf.sh
bindata/etcd/etcd-common-tools
pkg/operator/scriptcontroller/scriptcontroller.go
pkg/operator/scriptcontroller/scriptcontroller_test.go
pkg/operator/starter.go

🚧 Files skipped from review as they are similar to previous changes (2)

bindata/etcd/cluster-restore.sh
bindata/etcd/etcd-common-tools

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

bindata/etcd/cluster-restore-tnf.sh
pkg/operator/scriptcontroller/scriptcontroller.go
pkg/operator/starter.go
bindata/etcd/disable-etcd-tnf.sh
pkg/operator/scriptcontroller/scriptcontroller_test.go

🧬 Code graph analysis (4)

bindata/etcd/cluster-restore-tnf.sh (2)

bindata/etcd/cluster-restore.sh (2)

source_required_dependency (21-29)

usage (34-40)

bindata/etcd/disable-etcd-tnf.sh (1)

source_required_dependency (17-25)

pkg/operator/scriptcontroller/scriptcontroller.go (3)

pkg/etcdenvvar/envvarcontroller.go (1)

Enqueueable (65-67)

pkg/operator/ceohelpers/external_etcd_status.go (1)

IsExternalEtcdCluster (23-35)

bindata/assets.go (1)

MustAsset (15-21)

bindata/etcd/disable-etcd-tnf.sh (2)

bindata/etcd/cluster-restore-tnf.sh (1)

source_required_dependency (19-27)

bindata/etcd/cluster-restore.sh (1)

source_required_dependency (21-29)

pkg/operator/scriptcontroller/scriptcontroller_test.go (5)

pkg/operator/scriptcontroller/scriptcontroller.go (1)

ScriptController (34-40)

pkg/testutils/testutils.go (1)

FakeInfrastructureLister (563-571)

pkg/etcdenvvar/fake_envvar.go (1)

FakeEnvVar (3-6)

pkg/operator/operatorclient/interfaces.go (1)

TargetNamespace (7-7)

pkg/version/version.go (1)

Get (29-41)

🔇 Additional comments (4)

pkg/operator/starter.go (1)

471-478: LGTM!

The infrastructure lister is correctly passed to enable topology-aware script deployment. The lister is already instantiated and used by other controllers in this file, so this addition is consistent with the existing patterns.

pkg/operator/scriptcontroller/scriptcontroller_test.go (1)

37-117: LGTM!

Well-structured table-driven tests covering the three topology modes. The content assertions verify that topology-specific scripts are correctly selected based on infrastructure mode.

pkg/operator/scriptcontroller/scriptcontroller.go (2)

28-40: LGTM!

Clean interface extraction for EnvVarGetter that improves testability. The struct fields are properly typed and the interface methods align with the actual usage.

129-154: LGTM!

Topology-aware script deployment is cleanly implemented. TNF clusters get Pacemaker-specific scripts while standard topologies get the original scripts, both deployed under consistent names (cluster-restore.sh, disable-etcd.sh).

coderabbitai · 2026-01-12T09:39:54Z

+ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
+if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
+  ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
+fi


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's locate the file and examine its context fd "cluster-restore" --type f # Then search for ETCD_CLIENT assignments to understand the pattern rg -n "ETCD_CLIENT=" --type sh -A 2 -B 2 # Also search for ETCD_ETCDCTL_BIN usage rg -n "ETCD_ETCDCTL_BIN" --type sh -C 2

Repository: openshift/cluster-etcd-operator

Length of output: 1688

🏁 Script executed:

#!/bin/bash # Now let's understand how ETCD_CLIENT is used and where these variables are set echo "=== How ETCD_CLIENT is used ===" rg -n "ETCD_CLIENT" --type sh -A 1 -B 1 echo "=== Where ETCD_ETCDCTL_BIN is set ===" rg -n "ETCD_ETCDCTL_BIN=" --type sh -A 1 -B 1 echo "=== Where ETCD_ETCDUTL_BIN is set ===" rg -n "ETCD_ETCDUTL_BIN=" --type sh -A 1 -B 1

Repository: openshift/cluster-etcd-operator

Length of output: 1560

🏁 Script executed:

#!/bin/bash # Let's look at the entire files to understand context and see where these variables come from echo "=== First 50 lines of cluster-restore.sh ===" head -50 bindata/etcd/cluster-restore.sh echo "" echo "=== First 50 lines of cluster-restore-tnf.sh ===" head -50 bindata/etcd/cluster-restore-tnf.sh echo "" echo "=== Search for where ETCD_ETCDCTL_BIN and ETCD_ETCDUTL_BIN might be exported or defined ===" rg -n "export.*ETCD_ETCD" --type sh rg -n "ETCD_ETCD.*BIN" --type sh | head -20

Repository: openshift/cluster-etcd-operator

Length of output: 4040

🏁 Script executed:

#!/bin/bash # Let's find and examine the etcd-common-tools file that defines these variables fd "etcd-common-tools" --type f # Also look for any files in the repo that define these binaries rg -n "ETCD_ETCDCTL_BIN" --type sh -B 2 -A 2 rg -n "ETCD_ETCDUTL_BIN" --type sh -B 2 -A 2

Repository: openshift/cluster-etcd-operator

Length of output: 1494

🏁 Script executed:

#!/bin/bash # Let's examine the etcd-common-tools file to understand where these variables are defined cat -n bindata/etcd/etcd-common-tools | head -200

Repository: openshift/cluster-etcd-operator

Length of output: 5877

🏁 Script executed:

#!/bin/bash # Let's examine the full context around where these are used in cluster-restore.sh echo "=== Lines 75-90 in cluster-restore.sh (around ETCD_CLIENT assignment) ===" sed -n '75,90p' bindata/etcd/cluster-restore.sh echo "" echo "=== Lines around ETCD_CLIENT usage in cluster-restore.sh ===" sed -n '125,140p' bindata/etcd/cluster-restore.sh echo "" echo "=== Lines around ETCD_CLIENT usage in cluster-restore-tnf.sh ===" sed -n '155,170p' bindata/etcd/cluster-restore-tnf.sh

Repository: openshift/cluster-etcd-operator

Length of output: 2403

🏁 Script executed:

#!/bin/bash # Let's verify the actual bash behavior by testing the parameter expansion bash <<'EOF' # Test the actual behavior of the parameter expansion ETCD_ETCDCTL_BIN="etcdctl" ETCD_ETCDUTL_BIN="etcdutl" echo "Testing parameter expansion syntax:" echo "ETCD_ETCDCTL_BIN=${ETCD_ETCDCTL_BIN}" echo "ETCD_ETCDUTL_BIN=${ETCD_ETCDUTL_BIN}" # Test 1: The syntax used in cluster-restore.sh ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}" echo "Result with \${ETCD_ETCDCTL_BIN+etcdctl}: ETCD_CLIENT='${ETCD_CLIENT}'" # Test 2: Direct assignment (used in check_snapshot_status) ETCD_CLIENT="${ETCD_ETCDCTL_BIN}" echo "Result with \${ETCD_ETCDCTL_BIN}: ETCD_CLIENT='${ETCD_CLIENT}'" # Test 3: If ETCD_ETCDUTL_BIN is set if [ -n "${ETCD_ETCDUTL_BIN}" ]; then ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" fi echo "After override with ETCD_ETCDUTL_BIN: ETCD_CLIENT='${ETCD_CLIENT}'" EOF

Repository: openshift/cluster-etcd-operator

Length of output: 338

Align variable assignment with check_snapshot_status pattern above.

The parameter expansion ${ETCD_ETCDCTL_BIN+etcdctl} is inconsistent with the direct assignment used in the check_snapshot_status function (lines 57-59 of etcd-common-tools). Both produce the same result since these variables are hardcoded to literal strings, but using direct assignment would be clearer:

-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}" +ETCD_CLIENT="${ETCD_ETCDCTL_BIN}" if [ -n "${ETCD_ETCDUTL_BIN}" ]; then ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"

if [ -n "${ETCD_ETCDUTL_BIN}" ]; then

ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"

fi

ETCD_CLIENT="${ETCD_ETCDCTL_BIN}"

if [ -n "${ETCD_ETCDUTL_BIN}" ]; then

ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"

fi

🤖 Prompt for AI Agents

In @bindata/etcd/cluster-restore-tnf.sh around lines 130 - 133, ETCD_CLIENT is set using the parameter expansion `${ETCD_ETCDCTL_BIN+etcdctl}`, which is inconsistent with the simpler direct-assignment pattern used by the check_snapshot_status function; change the assignment to use a direct default value (e.g., set ETCD_CLIENT="etcdctl" first) and then override it when ETCD_ETCDUTL_BIN is present (keep the existing if [ -n "${ETCD_ETCDUTL_BIN}" ] branch), referencing the ETCD_CLIENT variable and the ETCD_ETCDCTL_BIN / ETCD_ETCDUTL_BIN symbols so the logic matches check_snapshot_status.

To maintain consistency with the original script, this functionality was carried over to the TNF version

@clobrano Understood. Maintaining consistency with the original cluster-restore.sh implementation is a valid approach for the TNF version. Thank you for the clarification.

Add dedicated etcd management scripts for Two Node with Fencing (TNF) deployments that use Pacemaker-managed Podman containers instead of Kubernetes static pods. Changes: - Add cluster-restore-tnf.sh with Pacemaker resource management - setup_pacemaker_restore() stops etcd resource and prepares restore flags - wait_for_podman_etcd_start() verifies etcd resource starts via pcs status - get_etcd_advertise_ip() detects advertise IP from etcd.env - Uses crm_attribute to set force_new_cluster on restart - Add disable-etcd-tnf.sh for disabling Pacemaker-managed etcd - Uses 'pcs resource disable etcd' instead of moving static pod manifests - Extend etcd-common-tools with shared TNF utilities - wait_for_podman_etcd_to_stop() waits for crm_resource to confirm stop - print_restore_completion_message() outputs rollout monitoring instructions These scripts will be conditionally deployed by the cluster-etcd-operator based on control plane topology detection.

Implement conditional script deployment in the script controller to automatically deploy TNF-specific or standard etcd scripts based on the detected control plane topology. This maintains backward compatibility while enabling automatic support for TNF deployments without manual script selection.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @bindata/etcd/cluster-restore-tnf.sh:
- Around line 51-68: The wait_for_podman_etcd_start() function currently assumes
two started etcd resources by checking wc -l -eq 2, which fails for single-node
restores initialized in setup_pacemaker_restore(); update the logic to accept
one or more started etcd instances (e.g., check that the line count is >= 1) or
make the expected count configurable based on the restore scenario, so the
function returns success for a single-node start during restore while still
validating multi-node cases when appropriate.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 207ce68 and 1664b42.

📒 Files selected for processing (7)

bindata/etcd/cluster-restore-tnf.sh
bindata/etcd/cluster-restore.sh
bindata/etcd/disable-etcd-tnf.sh
bindata/etcd/etcd-common-tools
pkg/operator/scriptcontroller/scriptcontroller.go
pkg/operator/scriptcontroller/scriptcontroller_test.go
pkg/operator/starter.go

🚧 Files skipped from review as they are similar to previous changes (3)

bindata/etcd/disable-etcd-tnf.sh
bindata/etcd/cluster-restore.sh
bindata/etcd/etcd-common-tools

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/operator/scriptcontroller/scriptcontroller_test.go
pkg/operator/scriptcontroller/scriptcontroller.go
pkg/operator/starter.go
bindata/etcd/cluster-restore-tnf.sh

🧬 Code graph analysis (2)

pkg/operator/scriptcontroller/scriptcontroller_test.go (4)

pkg/operator/scriptcontroller/scriptcontroller.go (1)

ScriptController (34-40)

pkg/testutils/testutils.go (1)

FakeInfrastructureLister (563-571)

pkg/etcdenvvar/fake_envvar.go (1)

FakeEnvVar (3-6)

pkg/operator/operatorclient/interfaces.go (1)

TargetNamespace (7-7)

bindata/etcd/cluster-restore-tnf.sh (1)

bindata/etcd/cluster-restore.sh (1)

usage (34-40)

🔇 Additional comments (12)

pkg/operator/starter.go (1)

471-478: LGTM - Infrastructure lister wiring follows existing patterns.

The infrastructure lister is already available from configInformers and is used consistently by other controllers in this file. This change correctly threads topology awareness to enable TNF-specific script deployment.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM - Test helper is well-structured.

The helper correctly sets up the minimal dependencies needed to test manageScriptConfigMap, using the existing FakeInfrastructureLister utility for topology simulation.

37-117: LGTM - Test coverage for topology-aware script deployment is comprehensive.

The test cases effectively validate:

TNF topology deploys Pacemaker-specific scripts (pcs resource enable/disable etcd)

Standard topologies deploy static-pod scripts (wait_for_containers_to_stop, mv_static_pods)

Common scripts are always present regardless of topology

The negative assertions (NotContains) provide good protection against script content leaking across topology boundaries.

119-131: LGTM - Error handling test validates graceful failure.

Good coverage of the error path when required environment variables are missing.

bindata/etcd/cluster-restore-tnf.sh (5)

1-17: LGTM - Script setup follows best practices.

Appropriate use of errexit, pipefail, errtrace for robust error handling, and root check is necessary for Pacemaker and etcd operations.

19-31: LGTM - Dependency sourcing with proper validation.

32-49: LGTM - Usage and IP detection are appropriate for TNF context.

The usage correctly omits the static pod resources reference (per standard script) since TNF uses Pacemaker. The IP detection properly leverages the NODE_<nodename>_IP pattern from etcd.env.

70-110: LGTM - Pacemaker restore setup is well-structured.

The function properly:

Validates required variables (NODENAME, IP) with helpful error messages

Disables etcd resource before manipulation

Waits for clean stop before proceeding

Handles missing config files gracefully

The || echo "Warning: ..." pattern at line 108 is acceptable since these are non-critical config files being moved to backup.

160-183: LGTM - Restore execution and completion handling.

The restore flow correctly:

Exits on snapshot restore failure (critical)

Sets force_new_cluster attribute for single-member bootstrap

Treats wait timeout as non-fatal with helpful messaging (allows admin to troubleshoot)

Always prints completion message for admin guidance

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM - Clean interface extraction for testability.

The EnvVarGetter interface properly abstracts the environment variable controller, enabling mock implementations in tests while preserving the listener registration pattern for re-queuing on changes.

34-55: LGTM - Controller struct and constructor updates.

The changes properly introduce infrastructure awareness while maintaining backward compatibility with existing controller patterns.

129-155: LGTM - Topology-aware script selection is well-implemented.

The logic correctly:

Detects TNF topology via ceohelpers.IsExternalEtcdCluster

Selects appropriate script assets based on topology

Deploys scripts with consistent names (cluster-restore.sh, disable-etcd.sh) regardless of source

Handles topology detection errors explicitly

clobrano · 2026-01-13T08:46:01Z

/retest-required

tjungblu · 2026-01-13T13:28:47Z

/lgtm

/override ci/prow/e2e-aws-ovn-serial-2of2

openshift-ci · 2026-01-13T13:29:41Z

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial-2of2

Details

In response to this:

/lgtm

/override ci/prow/e2e-aws-ovn-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-01-13T13:29:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-01-13T13:29:46Z

@clobrano: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-01-13T14:35:14Z

@clobrano: This pull request references Jira Issue OCPBUGS-60588, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]

Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]

Backup existing etcd data directory [existing]

Move podman-etcd configuration files to backup location [new]

Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]

Set force_new_cluster attribute for single-member bootstrap [new]

Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode

ETCD_CONTAINER_NAME: Podman container name (default: "etcd")

ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

clobrano · 2026-01-13T14:41:10Z

/jira refresh

openshift-ci-robot · 2026-01-13T14:41:21Z

@clobrano: This pull request references Jira Issue OCPBUGS-60588, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

clobrano · 2026-01-13T15:03:04Z

/verified by @clobrano

openshift-ci-robot · 2026-01-13T15:03:17Z

@clobrano: This PR has been marked as verified by @clobrano.

Details

In response to this:

/verified by @clobrano

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-01-13T15:07:23Z

@clobrano: Jira Issue Verification Checks: Jira Issue OCPBUGS-60588
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-60588 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]

Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]

Backup existing etcd data directory [existing]

Move podman-etcd configuration files to backup location [new]

Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]

Set force_new_cluster attribute for single-member bootstrap [new]

Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode

ETCD_CONTAINER_NAME: Podman container name (default: "etcd")

ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-01-19T16:23:55Z

Fix included in accepted release 4.22.0-0.nightly-2026-01-19-085729

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2025

tjungblu reviewed Dec 10, 2025

View reviewed changes

lance5890 reviewed Dec 11, 2025

View reviewed changes

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch 3 times, most recently from 17397af to ec62339 Compare December 11, 2025 17:08

clobrano marked this pull request as ready for review December 16, 2025 09:48

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025

openshift-ci Bot requested review from ironcladlou and tjungblu December 16, 2025 09:49

coderabbitai Bot reviewed Dec 16, 2025

View reviewed changes

Comment thread bindata/etcd/cluster-restore-tnf.sh

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from ec62339 to e46a48e Compare December 16, 2025 13:21

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from e46a48e to 4873375 Compare December 18, 2025 08:39

coderabbitai Bot reviewed Dec 18, 2025

View reviewed changes

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 4873375 to 5804bac Compare December 18, 2025 08:51

coderabbitai Bot reviewed Dec 18, 2025

View reviewed changes

Comment thread bindata/etcd/cluster-restore-tnf.sh Outdated

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 5804bac to 62a5b4c Compare December 18, 2025 10:30

coderabbitai Bot reviewed Dec 18, 2025

View reviewed changes

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 6b1d510 to 207ce68 Compare January 12, 2026 09:35

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 12, 2026

coderabbitai Bot reviewed Jan 12, 2026

View reviewed changes

clobrano commented Jan 12, 2026

View reviewed changes

Comment thread bindata/etcd/cluster-restore-tnf.sh

clobrano added 2 commits January 12, 2026 15:58

clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 207ce68 to 1664b42 Compare January 12, 2026 14:59

coderabbitai Bot reviewed Jan 12, 2026

View reviewed changes

Comment thread bindata/etcd/cluster-restore-tnf.sh

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2026

clobrano changed the title ~~[TNF] support restore for pacemaker-managed etcd~~ OCPBUGS-60588: [TNF] support restore for pacemaker-managed etcd Jan 13, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 13, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 13, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 13, 2026

openshift-merge-bot Bot merged commit d93728d into openshift:main Jan 13, 2026
15 checks passed

clobrano deleted the enhancement/tnf-support-for-cluster-restore-script branch January 13, 2026 16:19

		$ oc get co/etcd -w
		(Wait to see if the etcd operator completes the rollout successfully)

Conversation

clobrano commented Dec 10, 2025

Uh oh!

openshift-ci Bot commented Dec 10, 2025

Uh oh!

coderabbitai Bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clobrano Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clobrano commented Dec 16, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

clobrano commented Dec 18, 2025

Uh oh!

clobrano commented Jan 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Dec 10, 2025 •

edited

Loading

clobrano Jan 12, 2026 •

edited

Loading

coderabbitai Bot Jan 12, 2026 •

edited

Loading