Skip to content

OCPBUGS-60588: [TNF] support restore for pacemaker-managed etcd#1521

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
clobrano:enhancement/tnf-support-for-cluster-restore-script
Jan 13, 2026
Merged

OCPBUGS-60588: [TNF] support restore for pacemaker-managed etcd#1521
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
clobrano:enhancement/tnf-support-for-cluster-restore-script

Conversation

@clobrano
Copy link
Copy Markdown
Contributor

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

  1. Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]
  2. Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]
  3. Backup existing etcd data directory [existing]
  4. Move podman-etcd configuration files to backup location [new]
  5. Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]
  6. Set force_new_cluster attribute for single-member bootstrap [new]
  7. Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

  • PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode
  • ETCD_CONTAINER_NAME: Podman container name (default: "etcd")
  • ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 10, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 10, 2025

Walkthrough

Adds TNF-specific etcd restore/disable scripts and helpers, extends etcd-common-tools with stop/wait and completion messaging, makes ScriptController deploy topology-aware scripts using an Infrastructure lister and EnvVarGetter, and adds unit tests for the script deployment behavior.

Changes

Cohort / File(s) Summary
TNF-specific etcd scripts
bindata/etcd/cluster-restore-tnf.sh, bindata/etcd/disable-etcd-tnf.sh
New Bash scripts for Pacemaker/TNF environments: cluster-restore-tnf.sh orchestrates snapshot restore (root check, deps, advertise IP, etcdctl, snapshot restore, pacemaker interactions, backups, error handling); disable-etcd-tnf.sh disables the etcd Pacemaker resource and waits for podman-etcd to stop.
Etcd common tools & restore message
bindata/etcd/etcd-common-tools, bindata/etcd/cluster-restore.sh
Added wait_for_podman_etcd_to_stop() (120s polling via crm_resource --locate) and print_restore_completion_message(); appended call to print_restore_completion_message in cluster-restore.sh.
ScriptController topology-aware deployment
pkg/operator/scriptcontroller/scriptcontroller.go
Introduced EnvVarGetter interface, replaced concrete env-var controller field, added infraLister (InfrastructureLister) and constructor arg; logic now detects topology (TNF vs standard) and deploys appropriate cluster-restore.sh/disable-etcd.sh contents while still deploying common scripts.
Unit tests for script controller
pkg/operator/scriptcontroller/scriptcontroller_test.go
Added getTestController helper and tests: TestManageScriptConfigMap (DualReplica, HighlyAvailable, SingleReplica) verifying created ConfigMap contents and topology-specific script selection; TestManageScriptConfigMap_MissingEnvVars verifies error on missing env vars.
Operator startup wiring
pkg/operator/starter.go
Passes configInformers.Config().V1().Infrastructures().Lister() into NewScriptControllerController to thread the Infrastructure lister through startup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
  • 📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread bindata/etcd/cluster-restore.sh Outdated
return 0
fi

if systemctl status pacemaker >/dev/null 2>&1 ; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is smart, until somebody else comes around and uses pacemaker on the control plane 🤣
Can we make this more specific? there is likely the infra CRD somewhere on /etc/kubernetes/manifest.

Otherwise, this is also a advantage of having your own restore script installed by the operator only on TNF - that gives you the guarantee that you definitely run on TNF, allowing you to remove this altogether

Comment thread bindata/etcd/cluster-restore.sh Outdated
Comment on lines +82 to +88
function get_etcd_advertise_ip() {
# Try to detect IP from etcd.env (NODE_<nodename>_IP variable)
NODENAME_UNDERSCORE=$(echo "${NODENAME}" | tr '-' '_')
NODE_IP_VAR="NODE_${NODENAME_UNDERSCORE}_IP"
IP="${!NODE_IP_VAR}"
echo "$IP"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it always be localhost:2379?

Comment thread bindata/etcd/cluster-restore.sh Outdated
done

echo "timed out waiting for etcd resources to start (timeout: $timeout seconds)"
return 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you would want to exit 1 here directly, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also don't exit on the return of a 1 below

Comment thread bindata/etcd/cluster-restore.sh Outdated
Comment on lines +104 to +133
function ensure_etcd_container_stopped() {
echo "ensure ${ETCD_CONTAINER_NAME} container is stopped"
local start=$SECONDS
local timeout=120
local disable_etcd_called=0
while [ $((SECONDS - start)) -lt "$timeout" ]; do
if podman container exists "${ETCD_CONTAINER_NAME}" && \
[[ "$(podman inspect -f '{{.State.Status}}' "${ETCD_CONTAINER_NAME}")" != "exited" ]]; then

if [ "$disable_etcd_called" -eq 1 ]; then
sleep 1
continue
fi

if ! pcs resource disable etcd; then
echo "could not disable podman-etcd. Please ensure the ${ETCD_CONTAINER_NAME} container is stopped"
exit 1
fi

disable_etcd_called=1
continue
fi

echo "${ETCD_CONTAINER_NAME} container is stopped"
return 0
done

echo "timed out waiting for ${ETCD_CONTAINER_NAME} to stop"
return 1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a good candidate for ./disable-etcd.sh as a TNF specific script

Comment thread bindata/etcd/cluster-restore.sh Outdated

# Extend restore flags with Pacemaker-specific arguments
ETCDCTL_RESTORE_FLAGS+=(
--skip-hash-check=true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is definitely something you should keep and not skip

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman-etcd also makes copies (actually using cp) of the etcd db when a learner starts, which seemed to be incompatible with the hash check.
I was indeed considering to give some "special" name to such backups and let the script decide whether to use or not the hash check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which seemed to be incompatible with the hash check.

oof, that's probably something for us to take a deeper look at - I assume those are corrupted?

Copy link
Copy Markdown
Contributor Author

@clobrano clobrano Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you were right about the snapshot being corrupted, as I haven't reproduced the error since

Comment thread bindata/etcd/cluster-restore.sh Outdated
Comment on lines +155 to +157
--name="$NODENAME"
--initial-cluster="$NODENAME=https://$IP:2380"
--initial-advertise-peer-urls="https://$IP:2380"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you really need this, given that you just go and force-new-cluster in the end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried first without and failed with some "mismatch" error, IIRC

Comment thread bindata/etcd/cluster-restore.sh Outdated
echo "could not wait for etcd resources to restart"
fi

cat << 'EOF'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a really awesome idea, thanks for adding it

Comment thread bindata/etcd/cluster-restore.sh Outdated
Comment on lines +291 to +292
$ oc get co/etcd -w
(Wait to see if the etcd operator completes the rollout successfully)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also oc adm wait-for-stable-cluster

Comment thread bindata/etcd/cluster-restore.sh Outdated
Then monitor again:
$ oc get co/etcd -w

3. IF revision.json is still not being re-created on this node after the patch, manually restart the static pods:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkill cluster-etcd-operator might also do the trick, even though this is slightly less nice than what you're doing below ;)

Comment thread bindata/etcd/cluster-restore.sh Outdated
$ oc get co/etcd -w
(Wait to see if the etcd operator completes the rollout successfully)

2. IF the rollout does NOT complete, apply the forceRedeploymentReason patch:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also apply the oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TNF clusters already have this set to allow external Etcd management

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch 3 times, most recently from 17397af to ec62339 Compare December 11, 2025 17:08
@clobrano clobrano marked this pull request as ready for review December 16, 2025 09:48
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/operator/scriptcontroller/scriptcontroller.go (1)

132-134: Use %w for proper error wrapping.

The error at line 133 uses %v which loses the original error type. Using %w enables callers to use errors.Is() and errors.Unwrap().

-	if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil {
-		return nil, false, fmt.Errorf("failed to detect cluster topology: %v", err)
+	if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil {
+		return nil, false, fmt.Errorf("failed to detect cluster topology: %w", err)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e46bfd and ec62339.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/cluster-restore.sh
  • pkg/operator/starter.go
  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
🧬 Code graph analysis (2)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
🔇 Additional comments (16)
bindata/etcd/cluster-restore.sh (1)

143-143: LGTM!

Clean addition of the post-restore completion message hook. The function is properly sourced from etcd-common-tools earlier in the script.

bindata/etcd/disable-etcd-tnf.sh (1)

1-28: LGTM!

Script structure follows the established patterns from other etcd scripts. Proper root check, strict mode settings, and dependency sourcing.

bindata/etcd/etcd-common-tools (2)

96-115: Robust implementation with clear timeout handling.

The function properly handles the detection flow and timeout. One consideration: the crm_resource --locate check may return an error when the resource doesn't exist vs. when it's stopped. The current logic handles this by continuing to retry, which is reasonable.


117-144: LGTM!

Good use of single-quoted 'EOF' to prevent variable expansion in the heredoc, keeping the $() constructs literal for user execution. The step-by-step remediation instructions are helpful.

pkg/operator/starter.go (1)

471-478: LGTM!

Clean integration of the infrastructure lister into the script controller. This enables topology-aware script deployment (TNF vs standard) while reusing the existing informer infrastructure already initialized in RunOperator.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM!

Clean helper function that properly constructs test dependencies with fakes. Good separation of test setup from assertions.


37-117: Well-structured table-driven tests covering topology variants.

Good coverage of the three topology modes with both positive (expected content) and negative (unexpected content) assertions. The verification of common scripts presence ensures shared functionality is not accidentally removed.


119-131: LGTM!

Good error case coverage. Verifying the specific error message ensures the error is from the expected code path.

bindata/etcd/cluster-restore-tnf.sh (5)

43-49: Indirect variable expansion is valid but verify NODENAME is set.

The ${!NODE_IP_VAR} indirect expansion works correctly in Bash. The function assumes NODENAME is set before being called; this is handled in setup_pacemaker_restore at line 73-77 which sets it with a default from hostname.


51-68: Consider making the expected node count configurable.

The check [ "$(echo "$output" | wc -l)" -eq 2 ] assumes exactly 2 nodes for TNF. This is correct for dual-replica topology but hardcodes the assumption. Since this is a TNF-specific script, this is acceptable.


85-92: Verify --skip-hash-check=true is intentional.

Past review discussion indicates this was added due to podman-etcd making copies that are "incompatible with the hash check." If this is due to backup corruption, that's a deeper issue to investigate. If the copies are valid but have different hashes due to the copy mechanism, this is acceptable.

Based on past review comments, this appears to be a known workaround. Please confirm the backups are not corrupted and this is expected behavior for podman-etcd copies.


1-42: LGTM!

Script header, strict mode settings, root check, and dependency sourcing follow the established patterns from other etcd scripts. The simplified usage message correctly reflects that TNF restore only needs the snapshot (not static pod resources).


113-165: LGTM!

The main restore flow is well-structured with proper validation and error handling. Good safeguard at line 114-117 preventing BACKUP_DIR from being the same as ETCD_DATA_DIR.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: Good abstraction for testability.

The EnvVarGetter interface cleanly abstracts the environment variable controller, enabling easier unit testing through mocking.


129-140: LGTM - topology-aware script selection is well structured.

The conditional logic correctly selects TNF-specific or standard scripts based on the cluster topology, and the scripts are deployed under consistent names for downstream consumers.


152-154: All required script assets are present in bindata/etcd/: both TNF-specific (cluster-restore-tnf.sh, disable-etcd-tnf.sh) and standard (cluster-restore.sh, disable-etcd.sh) scripts exist and are properly referenced by the conditional topology detection logic.

Comment thread bindata/etcd/cluster-restore-tnf.sh
@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from ec62339 to e46a48e Compare December 16, 2025 13:21
@clobrano
Copy link
Copy Markdown
Contributor Author

/retest-required

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from e46a48e to 4873375 Compare December 18, 2025 08:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)

131-134: Simplify ETCD_CLIENT initialization.

The expression ${ETCD_ETCDCTL_BIN+etcdctl} always evaluates to "etcdctl" when ETCD_ETCDCTL_BIN is set (which it always is from etcd-common-tools). This seems like it should reference the variable value instead.

🔎 Apply this diff to fix the initialization:
-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
+ETCD_CLIENT="${ETCD_ETCDCTL_BIN:-etcdctl}"
 if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
   ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
 fi

This uses the variable's value if set, or defaults to "etcdctl" if unset.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e46a48e and 4873375.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • bindata/etcd/cluster-restore.sh
  • bindata/etcd/disable-etcd-tnf.sh
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • pkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
🔇 Additional comments (11)
bindata/etcd/etcd-common-tools (2)

96-115: LGTM! The wait function implementation is solid.

The polling logic correctly handles three states: command failure (retry), non-empty location (etcd still running), and empty location (stopped). The 120-second timeout is reasonable for Pacemaker-managed etcd stop operations.


117-143: Clear and actionable completion message.

The step-by-step instructions with concrete commands will help operators verify and troubleshoot the restore process. Good use of a heredoc for multi-line output.

pkg/operator/starter.go (1)

471-478: LGTM! Correct wiring of infrastructure lister and env var controller.

The infrastructure lister is properly obtained from the config informers and passed alongside the existing envVarController to enable topology-aware script deployment.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM! Clean test helper setup.

The helper correctly initializes the controller with fake dependencies and a no-op enqueue function for testing.


37-117: Good topology coverage with bidirectional assertions.

The test cases effectively verify that each topology gets its specific scripts while excluding content from other topologies. Testing both expectedContent and unexpectedContent is a solid approach to catch incorrect script selection.


119-131: LGTM! Error handling test validates graceful failure.

Good coverage for the missing environment variables scenario.

bindata/etcd/cluster-restore-tnf.sh (3)

143-155: LGTM! Data directory backup and restore sequence is well-structured.

The script correctly backs up the existing member directory before removal, and creates a fresh data directory for the restore. The sequence prevents accidental data loss.


167-182: LGTM! Pacemaker integration for cluster bootstrap.

The use of crm_attribute for force_new_cluster and subsequent pcs resource enable correctly implements the Pacemaker-based etcd restart flow. The decision to continue to print_restore_completion_message even on timeout (per past review discussion) provides operators with guidance regardless of wait outcome.


51-68: The hardcoded check for exactly 2 etcd instances may not align with single-node initialization and lacks documentation.

Line 58 checks [ "$(echo "$output" | wc -l)" -eq 2 ], but the script initializes with a single-node cluster (--initial-cluster="$NODENAME=..."). This mismatch is unexplained. Either document why exactly 2 lines are expected (e.g., "TNF always has 2 nodes"), or make this check more robust by verifying >= 1 if the goal is confirming at least one instance started.

pkg/operator/scriptcontroller/scriptcontroller.go (2)

28-32: LGTM! Clean interface design.

The EnvVarGetter interface properly abstracts the environment variable functionality needed by the controller, following Go interface conventions.


129-155: LGTM! Topology-aware script selection is well-implemented.

The logic correctly:

  1. Detects TNF topology via IsExternalEtcdCluster
  2. Handles the error case appropriately
  3. Selects the correct script variants based on topology
  4. Deploys them under standard names (cluster-restore.sh, disable-etcd.sh)

This allows consumers to use consistent script names regardless of the underlying topology.

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 4873375 to 5804bac Compare December 18, 2025 08:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)

19-27: Consider consolidating duplicate helper functions.

The source_required_dependency function is duplicated across cluster-restore-tnf.sh, disable-etcd-tnf.sh (lines 17-25), and cluster-restore.sh (lines 20-28). While this duplication ensures self-contained scripts, consider moving such common helpers to etcd-common-tools if these functions evolve or grow in complexity.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4873375 and 5804bac.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • bindata/etcd/cluster-restore.sh
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/disable-etcd-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore-tnf.sh (4)

43-49: LGTM!

The variable indirection pattern with ${!NODE_IP_VAR} correctly retrieves the IP address from dynamically constructed environment variable names.


51-68: LGTM!

The polling logic with a 5-minute timeout and XML parsing to detect etcd startup is appropriate. The check for exactly 2 lines on line 58 appears to be a topology-specific validation.


94-102: LGTM!

The resource disable flow with pcs resource disable followed by wait_for_podman_etcd_to_stop (from etcd-common-tools) properly ensures a clean shutdown before proceeding with the restore.


178-182: LGTM!

The approach of printing the completion message even on timeout (lines 178-180) is intentional per previous discussion. This ensures operators always receive verification instructions, which is appropriate since timeout doesn't necessarily indicate restore failure.

pkg/operator/starter.go (1)

471-478: LGTM!

The addition of the infrastructure lister at line 475 correctly wires topology information into the ScriptController, enabling topology-aware script deployment. This follows the established pattern of passing listers to controllers.

bindata/etcd/disable-etcd-tnf.sh (2)

1-15: LGTM!

The script setup with strict shell options (errexit, pipefail, errtrace) and root check follows best practices for operational scripts.


27-34: LGTM!

The script correctly sources required dependencies and calls wait_for_podman_etcd_to_stop from etcd-common-tools with appropriate error handling. The past review concern about the undefined function has been resolved.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM!

The EnvVarGetter interface provides a clean abstraction over environment variable access and listener management, improving testability and decoupling from concrete implementations.


34-40: LGTM!

The refactoring to use InfraLister and EnvVarGetter interface (lines 37-38, 46-47, 53-54) properly enables topology detection while maintaining clean separation of concerns. The initialization is correct and follows established patterns.

Also applies to: 42-55


129-154: LGTM!

The topology-aware script selection logic is well-implemented:

  • Line 132: Proper error handling for topology detection failure
  • Lines 135-140: Clear conditional selection of TNF vs standard scripts
  • Lines 142-150: Common scripts (quorum-restore.sh, cluster-backup.sh, etcd-common-tools) deployed regardless of topology
  • Lines 153-154: Topology-specific scripts deployed under standardized names (cluster-restore.sh, disable-etcd.sh), ensuring consistent paths for consuming code

This approach maintains backward compatibility while supporting TNF deployments.

Comment thread bindata/etcd/cluster-restore-tnf.sh Outdated
@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 5804bac to 62a5b4c Compare December 18, 2025 10:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
bindata/etcd/disable-etcd-tnf.sh (1)

17-25: Consider extracting shared helper to reduce duplication.

The source_required_dependency function is duplicated across cluster-restore.sh, cluster-restore-tnf.sh, and this file. While consistency is valuable, this duplication could lead to maintenance issues if the error handling logic needs to change.

🔎 Optional: Move to etcd-common-tools

Consider moving this helper to etcd-common-tools so it can be shared across all scripts. However, this creates a chicken-and-egg problem (you need the helper to source the file that contains it), so the current duplication may be acceptable.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 5804bac and 62a5b4c.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/operator/starter.go
  • bindata/etcd/cluster-restore.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/cluster-restore-tnf.sh
🧬 Code graph analysis (2)
bindata/etcd/disable-etcd-tnf.sh (2)
bindata/etcd/cluster-restore-tnf.sh (1)
  • source_required_dependency (19-27)
bindata/etcd/cluster-restore.sh (1)
  • source_required_dependency (21-29)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore.sh (1)

143-143: LGTM!

The addition of print_restore_completion_message at the end of the restore flow appropriately provides operators with post-restore guidance and verification instructions. The function is properly sourced from etcd-common-tools (line 32).

pkg/operator/starter.go (1)

471-478: LGTM!

Threading the Infrastructure lister into ScriptController initialization properly enables topology-aware script deployment. The lister is sourced from the already-initialized configInformers and aligns with the pattern used by other controllers in this file.

bindata/etcd/disable-etcd-tnf.sh (1)

30-34: LGTM!

The script correctly calls wait_for_podman_etcd_to_stop (from etcd-common-tools) and handles failures appropriately with a clear error message and non-zero exit code.

bindata/etcd/cluster-restore-tnf.sh (4)

70-110: LGTM!

The setup_pacemaker_restore function properly orchestrates the TNF-specific restore preparation:

  • Validates required environment (hostname, IP address)
  • Constructs appropriate etcdctl restore flags for single-member bootstrap
  • Disables the Pacemaker etcd resource and waits for clean shutdown
  • Safely moves configuration files with proper error handling

The array initialization at line 86 uses assignment (=), not append, so no pre-initialization is needed.


142-164: LGTM!

The backup and restore logic properly:

  • Backs up existing data-dir with cleanup of old backups
  • Completely removes and recreates data-dir for a clean restore
  • Uses the appropriate etcd client binary (etcdctl or etcdutl)
  • Includes a clear comment explaining why revision bumping is skipped

The error handling and success messaging provide good operator feedback.


166-181: LGTM!

The restart and completion logic appropriately:

  • Sets the force_new_cluster attribute for single-member bootstrap
  • Re-enables the Pacemaker etcd resource
  • Shows completion message regardless of wait timeout to provide verification instructions

The intentional non-exit on wait_for_podman_etcd_start failure (line 177-179) was confirmed in past reviews as correct behavior for providing operators with verification guidance even on timeout.


51-68: No action required. The 2-line check is appropriate for TNF's constrained topology.

TNF clusters are validated to contain exactly 2 nodes by the configuration layer (pkg/tnf/pkg/config/cluster.go enforces len(nodes.Items) == 2). The pacemaker cluster setup is also hardcoded for 2 nodes. The check for exactly 2 lines in the pcs status xml output correctly verifies that both nodes have their etcd resource started. This is not a fragile assumption but an appropriate implementation for a topology that is constrained by design, not convention.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM!

The EnvVarGetter interface properly abstracts environment variable access and listener management, enabling clean dependency injection and testability. The interface methods align with their usage at lines 56 and 119.


42-56: LGTM!

The constructor signature and initialization properly incorporate the infrastructure lister and environment variable getter. The listener registration (line 56) ensures the controller receives updates when environment variables change.


129-154: LGTM!

The topology-aware script deployment logic properly:

  • Detects cluster topology using IsExternalEtcdCluster with appropriate error handling
  • Selects TNF-specific scripts (cluster-restore-tnf.sh, disable-etcd-tnf.sh) for external etcd clusters
  • Falls back to standard scripts for traditional deployments
  • Deploys common scripts (quorum-restore, cluster-backup, etcd-common-tools) for both topologies
  • Maps topology-specific scripts to standard names (cluster-restore.sh, disable-etcd.sh) so consumers don't need topology awareness

This design cleanly separates topology concerns from script consumers.

@clobrano
Copy link
Copy Markdown
Contributor Author

/retest-required

@clobrano
Copy link
Copy Markdown
Contributor Author

clobrano commented Jan 9, 2026

There are some typos, found by CodeRabbit. I'll fix them next week

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 6b1d510 to 207ce68 Compare January 12, 2026 09:35
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 12, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @bindata/etcd/cluster-restore-tnf.sh:
- Around line 51-68: The wait_for_podman_etcd_start function currently requires
exactly two "Started" lines which fails for single-node restores; update the
check in wait_for_podman_etcd_start (the branch that inspects output from pcs
status xml | grep 'id="etcd".*role="Started"') to accept at least one started
instance (e.g., test that the line count is >= 1) instead of == 2 so the
function succeeds when only the local node has etcd enabled during restore.
- Around line 130-133: ETCD_CLIENT is set using the parameter expansion
`${ETCD_ETCDCTL_BIN+etcdctl}`, which is inconsistent with the simpler
direct-assignment pattern used by the check_snapshot_status function; change the
assignment to use a direct default value (e.g., set ETCD_CLIENT="etcdctl" first)
and then override it when ETCD_ETCDUTL_BIN is present (keep the existing if [ -n
"${ETCD_ETCDUTL_BIN}" ] branch), referencing the ETCD_CLIENT variable and the
ETCD_ETCDCTL_BIN / ETCD_ETCDUTL_BIN symbols so the logic matches
check_snapshot_status.

In @bindata/etcd/disable-etcd-tnf.sh:
- Around line 30-34: The script calls wait_for_podman_etcd_to_stop without first
disabling the Pacemaker resource; insert a pcs resource disable etcd call before
invoking wait_for_podman_etcd_to_stop so the etcd resource is instructed to stop
(match the pattern used in cluster-restore-tnf.sh), then proceed to echo and
exit on failure as currently done; update the block that contains
wait_for_podman_etcd_to_stop to run pcs resource disable etcd first.
🧹 Nitpick comments (2)
bindata/etcd/cluster-restore-tnf.sh (1)

118-124: Minor: Improve error message when snapshot not found.

If the ls command finds no matching files, SNAPSHOT_FILE will be empty, and the error message at line 122 will show an empty path, making it harder for operators to debug.

♻️ Suggested improvement
 # shellcheck disable=SC2012
 SNAPSHOT_FILE=$(ls -vd "${BACKUP_DIR}"/snapshot*.db | tail -1) || true
 
 if [ ! -f "${SNAPSHOT_FILE}" ]; then
-  echo "etcd snapshot ${SNAPSHOT_FILE} does not exist"
+  echo "etcd snapshot not found in ${BACKUP_DIR} (expected snapshot*.db)"
   exit 1
 fi
pkg/operator/scriptcontroller/scriptcontroller_test.go (1)

119-131: Consider adding a test for topology detection failure.

The current tests cover missing env vars but not the case where ceohelpers.IsExternalEtcdCluster returns an error (e.g., infrastructure resource not found). This would improve test coverage for the error path at lines 132-133 in scriptcontroller.go.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6b1d510 and 207ce68.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh
  • bindata/etcd/cluster-restore.sh
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • pkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • bindata/etcd/cluster-restore.sh
  • bindata/etcd/etcd-common-tools
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • pkg/operator/starter.go
  • bindata/etcd/disable-etcd-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
🧬 Code graph analysis (4)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
bindata/etcd/disable-etcd-tnf.sh (2)
bindata/etcd/cluster-restore-tnf.sh (1)
  • source_required_dependency (19-27)
bindata/etcd/cluster-restore.sh (1)
  • source_required_dependency (21-29)
pkg/operator/scriptcontroller/scriptcontroller_test.go (5)
pkg/operator/scriptcontroller/scriptcontroller.go (1)
  • ScriptController (34-40)
pkg/testutils/testutils.go (1)
  • FakeInfrastructureLister (563-571)
pkg/etcdenvvar/fake_envvar.go (1)
  • FakeEnvVar (3-6)
pkg/operator/operatorclient/interfaces.go (1)
  • TargetNamespace (7-7)
pkg/version/version.go (1)
  • Get (29-41)
🔇 Additional comments (4)
pkg/operator/starter.go (1)

471-478: LGTM!

The infrastructure lister is correctly passed to enable topology-aware script deployment. The lister is already instantiated and used by other controllers in this file, so this addition is consistent with the existing patterns.

pkg/operator/scriptcontroller/scriptcontroller_test.go (1)

37-117: LGTM!

Well-structured table-driven tests covering the three topology modes. The content assertions verify that topology-specific scripts are correctly selected based on infrastructure mode.

pkg/operator/scriptcontroller/scriptcontroller.go (2)

28-40: LGTM!

Clean interface extraction for EnvVarGetter that improves testability. The struct fields are properly typed and the interface methods align with the actual usage.


129-154: LGTM!

Topology-aware script deployment is cleanly implemented. TNF clusters get Pacemaker-specific scripts while standard topologies get the original scripts, both deployed under consistent names (cluster-restore.sh, disable-etcd.sh).

Comment thread bindata/etcd/cluster-restore-tnf.sh
Comment on lines +130 to +133
ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
fi
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's locate the file and examine its context
fd "cluster-restore" --type f

# Then search for ETCD_CLIENT assignments to understand the pattern
rg -n "ETCD_CLIENT=" --type sh -A 2 -B 2

# Also search for ETCD_ETCDCTL_BIN usage
rg -n "ETCD_ETCDCTL_BIN" --type sh -C 2

Repository: openshift/cluster-etcd-operator

Length of output: 1688


🏁 Script executed:

#!/bin/bash
# Now let's understand how ETCD_CLIENT is used and where these variables are set
echo "=== How ETCD_CLIENT is used ==="
rg -n "ETCD_CLIENT" --type sh -A 1 -B 1

echo "=== Where ETCD_ETCDCTL_BIN is set ==="
rg -n "ETCD_ETCDCTL_BIN=" --type sh -A 1 -B 1

echo "=== Where ETCD_ETCDUTL_BIN is set ==="
rg -n "ETCD_ETCDUTL_BIN=" --type sh -A 1 -B 1

Repository: openshift/cluster-etcd-operator

Length of output: 1560


🏁 Script executed:

#!/bin/bash
# Let's look at the entire files to understand context and see where these variables come from
echo "=== First 50 lines of cluster-restore.sh ==="
head -50 bindata/etcd/cluster-restore.sh

echo ""
echo "=== First 50 lines of cluster-restore-tnf.sh ==="
head -50 bindata/etcd/cluster-restore-tnf.sh

echo ""
echo "=== Search for where ETCD_ETCDCTL_BIN and ETCD_ETCDUTL_BIN might be exported or defined ==="
rg -n "export.*ETCD_ETCD" --type sh
rg -n "ETCD_ETCD.*BIN" --type sh | head -20

Repository: openshift/cluster-etcd-operator

Length of output: 4040


🏁 Script executed:

#!/bin/bash
# Let's find and examine the etcd-common-tools file that defines these variables
fd "etcd-common-tools" --type f

# Also look for any files in the repo that define these binaries
rg -n "ETCD_ETCDCTL_BIN" --type sh -B 2 -A 2
rg -n "ETCD_ETCDUTL_BIN" --type sh -B 2 -A 2

Repository: openshift/cluster-etcd-operator

Length of output: 1494


🏁 Script executed:

#!/bin/bash
# Let's examine the etcd-common-tools file to understand where these variables are defined
cat -n bindata/etcd/etcd-common-tools | head -200

Repository: openshift/cluster-etcd-operator

Length of output: 5877


🏁 Script executed:

#!/bin/bash
# Let's examine the full context around where these are used in cluster-restore.sh
echo "=== Lines 75-90 in cluster-restore.sh (around ETCD_CLIENT assignment) ==="
sed -n '75,90p' bindata/etcd/cluster-restore.sh

echo ""
echo "=== Lines around ETCD_CLIENT usage in cluster-restore.sh ==="
sed -n '125,140p' bindata/etcd/cluster-restore.sh

echo ""
echo "=== Lines around ETCD_CLIENT usage in cluster-restore-tnf.sh ==="
sed -n '155,170p' bindata/etcd/cluster-restore-tnf.sh

Repository: openshift/cluster-etcd-operator

Length of output: 2403


🏁 Script executed:

#!/bin/bash
# Let's verify the actual bash behavior by testing the parameter expansion
bash <<'EOF'
# Test the actual behavior of the parameter expansion
ETCD_ETCDCTL_BIN="etcdctl"
ETCD_ETCDUTL_BIN="etcdutl"

echo "Testing parameter expansion syntax:"
echo "ETCD_ETCDCTL_BIN=${ETCD_ETCDCTL_BIN}"
echo "ETCD_ETCDUTL_BIN=${ETCD_ETCDUTL_BIN}"

# Test 1: The syntax used in cluster-restore.sh
ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
echo "Result with \${ETCD_ETCDCTL_BIN+etcdctl}: ETCD_CLIENT='${ETCD_CLIENT}'"

# Test 2: Direct assignment (used in check_snapshot_status)
ETCD_CLIENT="${ETCD_ETCDCTL_BIN}"
echo "Result with \${ETCD_ETCDCTL_BIN}: ETCD_CLIENT='${ETCD_CLIENT}'"

# Test 3: If ETCD_ETCDUTL_BIN is set
if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
  ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
fi
echo "After override with ETCD_ETCDUTL_BIN: ETCD_CLIENT='${ETCD_CLIENT}'"
EOF

Repository: openshift/cluster-etcd-operator

Length of output: 338


Align variable assignment with check_snapshot_status pattern above.

The parameter expansion ${ETCD_ETCDCTL_BIN+etcdctl} is inconsistent with the direct assignment used in the check_snapshot_status function (lines 57-59 of etcd-common-tools). Both produce the same result since these variables are hardcoded to literal strings, but using direct assignment would be clearer:

-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
+ETCD_CLIENT="${ETCD_ETCDCTL_BIN}"
 if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
   ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
fi
ETCD_CLIENT="${ETCD_ETCDCTL_BIN}"
if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
fi
🤖 Prompt for AI Agents
In @bindata/etcd/cluster-restore-tnf.sh around lines 130 - 133, ETCD_CLIENT is
set using the parameter expansion `${ETCD_ETCDCTL_BIN+etcdctl}`, which is
inconsistent with the simpler direct-assignment pattern used by the
check_snapshot_status function; change the assignment to use a direct default
value (e.g., set ETCD_CLIENT="etcdctl" first) and then override it when
ETCD_ETCDUTL_BIN is present (keep the existing if [ -n "${ETCD_ETCDUTL_BIN}" ]
branch), referencing the ETCD_CLIENT variable and the ETCD_ETCDCTL_BIN /
ETCD_ETCDUTL_BIN symbols so the logic matches check_snapshot_status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain consistency with the original script, this functionality was carried over to the TNF version

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clobrano Understood. Maintaining consistency with the original cluster-restore.sh implementation is a valid approach for the TNF version. Thank you for the clarification.

Comment thread bindata/etcd/disable-etcd-tnf.sh
Comment thread bindata/etcd/cluster-restore-tnf.sh
Add dedicated etcd management scripts for Two Node with Fencing (TNF)
deployments that use Pacemaker-managed Podman containers instead of
Kubernetes static pods.

Changes:
- Add cluster-restore-tnf.sh with Pacemaker resource management
  - setup_pacemaker_restore() stops etcd resource and prepares restore flags
  - wait_for_podman_etcd_start() verifies etcd resource starts via pcs status
  - get_etcd_advertise_ip() detects advertise IP from etcd.env
  - Uses crm_attribute to set force_new_cluster on restart
- Add disable-etcd-tnf.sh for disabling Pacemaker-managed etcd
  - Uses 'pcs resource disable etcd' instead of moving static pod manifests
- Extend etcd-common-tools with shared TNF utilities
  - wait_for_podman_etcd_to_stop() waits for crm_resource to confirm stop
  - print_restore_completion_message() outputs rollout monitoring instructions

These scripts will be conditionally deployed by the cluster-etcd-operator
based on control plane topology detection.
Implement conditional script deployment in the script controller to
automatically deploy TNF-specific or standard etcd scripts based on the
detected control plane topology.

This maintains backward compatibility while enabling automatic support
for TNF deployments without manual script selection.
@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 207ce68 to 1664b42 Compare January 12, 2026 14:59
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @bindata/etcd/cluster-restore-tnf.sh:
- Around line 51-68: The wait_for_podman_etcd_start() function currently assumes
two started etcd resources by checking wc -l -eq 2, which fails for single-node
restores initialized in setup_pacemaker_restore(); update the logic to accept
one or more started etcd instances (e.g., check that the line count is >= 1) or
make the expected count configurable based on the restore scenario, so the
function returns success for a single-node start during restore while still
validating multi-node cases when appropriate.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 207ce68 and 1664b42.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh
  • bindata/etcd/cluster-restore.sh
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • pkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/cluster-restore.sh
  • bindata/etcd/etcd-common-tools
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • pkg/operator/starter.go
  • bindata/etcd/cluster-restore-tnf.sh
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller_test.go (4)
pkg/operator/scriptcontroller/scriptcontroller.go (1)
  • ScriptController (34-40)
pkg/testutils/testutils.go (1)
  • FakeInfrastructureLister (563-571)
pkg/etcdenvvar/fake_envvar.go (1)
  • FakeEnvVar (3-6)
pkg/operator/operatorclient/interfaces.go (1)
  • TargetNamespace (7-7)
bindata/etcd/cluster-restore-tnf.sh (1)
bindata/etcd/cluster-restore.sh (1)
  • usage (34-40)
🔇 Additional comments (12)
pkg/operator/starter.go (1)

471-478: LGTM - Infrastructure lister wiring follows existing patterns.

The infrastructure lister is already available from configInformers and is used consistently by other controllers in this file. This change correctly threads topology awareness to enable TNF-specific script deployment.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM - Test helper is well-structured.

The helper correctly sets up the minimal dependencies needed to test manageScriptConfigMap, using the existing FakeInfrastructureLister utility for topology simulation.


37-117: LGTM - Test coverage for topology-aware script deployment is comprehensive.

The test cases effectively validate:

  • TNF topology deploys Pacemaker-specific scripts (pcs resource enable/disable etcd)
  • Standard topologies deploy static-pod scripts (wait_for_containers_to_stop, mv_static_pods)
  • Common scripts are always present regardless of topology

The negative assertions (NotContains) provide good protection against script content leaking across topology boundaries.


119-131: LGTM - Error handling test validates graceful failure.

Good coverage of the error path when required environment variables are missing.

bindata/etcd/cluster-restore-tnf.sh (5)

1-17: LGTM - Script setup follows best practices.

Appropriate use of errexit, pipefail, errtrace for robust error handling, and root check is necessary for Pacemaker and etcd operations.


19-31: LGTM - Dependency sourcing with proper validation.


32-49: LGTM - Usage and IP detection are appropriate for TNF context.

The usage correctly omits the static pod resources reference (per standard script) since TNF uses Pacemaker. The IP detection properly leverages the NODE_<nodename>_IP pattern from etcd.env.


70-110: LGTM - Pacemaker restore setup is well-structured.

The function properly:

  • Validates required variables (NODENAME, IP) with helpful error messages
  • Disables etcd resource before manipulation
  • Waits for clean stop before proceeding
  • Handles missing config files gracefully

The || echo "Warning: ..." pattern at line 108 is acceptable since these are non-critical config files being moved to backup.


160-183: LGTM - Restore execution and completion handling.

The restore flow correctly:

  • Exits on snapshot restore failure (critical)
  • Sets force_new_cluster attribute for single-member bootstrap
  • Treats wait timeout as non-fatal with helpful messaging (allows admin to troubleshoot)
  • Always prints completion message for admin guidance
pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM - Clean interface extraction for testability.

The EnvVarGetter interface properly abstracts the environment variable controller, enabling mock implementations in tests while preserving the listener registration pattern for re-queuing on changes.


34-55: LGTM - Controller struct and constructor updates.

The changes properly introduce infrastructure awareness while maintaining backward compatibility with existing controller patterns.


129-155: LGTM - Topology-aware script selection is well-implemented.

The logic correctly:

  • Detects TNF topology via ceohelpers.IsExternalEtcdCluster
  • Selects appropriate script assets based on topology
  • Deploys scripts with consistent names (cluster-restore.sh, disable-etcd.sh) regardless of source
  • Handles topology detection errors explicitly

Comment thread bindata/etcd/cluster-restore-tnf.sh
@clobrano
Copy link
Copy Markdown
Contributor Author

/retest-required

@tjungblu
Copy link
Copy Markdown
Contributor

/lgtm

/override ci/prow/e2e-aws-ovn-serial-2of2

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 13, 2026

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial-2of2

Details

In response to this:

/lgtm

/override ci/prow/e2e-aws-ovn-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 13, 2026

@clobrano: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@clobrano clobrano changed the title [TNF] support restore for pacemaker-managed etcd OCPBUGS-60588: [TNF] support restore for pacemaker-managed etcd Jan 13, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@clobrano: This pull request references Jira Issue OCPBUGS-60588, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

  1. Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]
  2. Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]
  3. Backup existing etcd data directory [existing]
  4. Move podman-etcd configuration files to backup location [new]
  5. Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]
  6. Set force_new_cluster attribute for single-member bootstrap [new]
  7. Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

  • PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode
  • ETCD_CONTAINER_NAME: Podman container name (default: "etcd")
  • ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@clobrano: This pull request references Jira Issue OCPBUGS-60588, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano
Copy link
Copy Markdown
Contributor Author

/verified by @clobrano

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@clobrano: This PR has been marked as verified by @clobrano.

Details

In response to this:

/verified by @clobrano

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit d93728d into openshift:main Jan 13, 2026
15 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@clobrano: Jira Issue Verification Checks: Jira Issue OCPBUGS-60588
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-60588 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

  1. Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]
  2. Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]
  3. Backup existing etcd data directory [existing]
  4. Move podman-etcd configuration files to backup location [new]
  5. Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]
  6. Set force_new_cluster attribute for single-member bootstrap [new]
  7. Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

  • PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode
  • ETCD_CONTAINER_NAME: Podman container name (default: "etcd")
  • ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano clobrano deleted the enhancement/tnf-support-for-cluster-restore-script branch January 13, 2026 16:19
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.22.0-0.nightly-2026-01-19-085729

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants