Skip to content

Commit 83ecb66

Browse files
Disable restart controller (#751)
* add waring to format-namenode script * wip * adapted changelog * remove pr ref for restart enable * Update docs/modules/hdfs/pages/reference/troubleshooting.adoc Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> * precommit * added todo for restart controller --------- Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>
1 parent 2952224 commit 83ecb66

File tree

6 files changed

+45
-18
lines changed

6 files changed

+45
-18
lines changed

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,20 @@ All notable changes to this project will be documented in this file.
99
- Add conversion webhook ([#753]).
1010
- Support objectOverrides using `.spec.objectOverrides`.
1111
See [objectOverrides concepts page](https://docs.stackable.tech/home/nightly/concepts/overrides/#object-overrides) for details ([#741]).
12-
- Enable the [restart-controller](https://docs.stackable.tech/home/nightly/commons-operator/restarter/), so that the Pods are automatically restarted on config changes ([#743]).
1312

1413
### Changed
1514

1615
- Gracefully shutdown all concurrent tasks by forwarding the SIGTERM signal ([#747]).
16+
- Added warning and exit condition to format-namenodes container script to check for corrupted data after formatting ([#751]).
1717

1818
### Fixed
1919

2020
- Previously, some shell output of init-containers was not logged properly and therefore not aggregated, which is fixed now ([#746]).
2121

2222
[#741]: https://github.com/stackabletech/hdfs-operator/pull/741
23-
[#743]: https://github.com/stackabletech/hdfs-operator/pull/743
2423
[#746]: https://github.com/stackabletech/hdfs-operator/pull/746
2524
[#747]: https://github.com/stackabletech/hdfs-operator/pull/747
25+
[#751]: https://github.com/stackabletech/hdfs-operator/pull/751
2626
[#753]: https://github.com/stackabletech/hdfs-operator/pull/753
2727

2828
## [25.11.0] - 2025-11-07
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
= Troubleshooting
2+
3+
[#init-container-format-namenode-fails]
4+
== Init container format-namenodes fails
5+
6+
When creating fresh HDFS clusters, unexpected pod restarts might corrupt the initial namenode formatting.
7+
This leaves the namenode data PVC in a dangling state, where e.g. the `../current/VERSION` file is created, but `../current/fsimage_xxx` files are missing.
8+
9+
After a restart corrupted the namenode formatting, reformatting again fails due to directories and files existing.
10+
We do not want to force (override) the formatting process to avoid data loss and other implications.
11+
12+
[source]
13+
----
14+
Running in non-interactive mode, and data appears to exist in Storage Directory root= /stackable/data/namenode; location= null. Not formatting.
15+
----
16+
17+
Another error message indicating a corrupt formatting state appears in the namenode main container during startup.
18+
19+
[source]
20+
----
21+
java.io.FileNotFoundException: No valid image files found
22+
----
23+
24+
WARNING: The following fix should only be applied to fresh clusters. For existing clusters please consider support.
25+
26+
1. Remove the PVC called `data-<cluster-name>-namenode-<rolegroup>-0` for a failed namenode 0.
27+
2. Restart the namenode afterwards.

docs/modules/hdfs/partials/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@
2323
** xref:hdfs:reference/discovery.adoc[]
2424
** xref:hdfs:reference/commandline-parameters.adoc[]
2525
** xref:hdfs:reference/environment-variables.adoc[]
26+
* xref:hdfs:reference/troubleshooting.adoc[]

rust/operator-binary/src/container.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -718,6 +718,16 @@ impl ContainerConfig {
718718
exclude_from_capture {hadoop_home}/bin/hdfs namenode -bootstrapStandby -nonInteractive
719719
fi
720720
else
721+
# Sanity check for initial format data corruption: VERSION file exists but no fsimage files were created.
722+
FSIMAGE_COUNT=$(find "{NAMENODE_ROOT_DATA_DIR}/current" -maxdepth 1 -regextype posix-egrep -regex ".*/fsimage_[0-9]+" | wc -l)
723+
724+
if [ "${{FSIMAGE_COUNT}}" -eq 0 ]
725+
then
726+
echo "WARNING: {NAMENODE_ROOT_DATA_DIR}/current/VERSION file exists but no fsimage files were found."
727+
echo "This indicates an incomplete and corrupted namenode formatting. Please check the troubleshooting guide."
728+
exit 1
729+
fi
730+
721731
cat "{NAMENODE_ROOT_DATA_DIR}/current/VERSION"
722732
echo "Pod $POD_NAME already formatted. Skipping..."
723733
fi

rust/operator-binary/src/hdfs_controller.rs

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ use stackable_operator::{
2222
product_image_selection::{self, ResolvedProductImage},
2323
rbac::build_rbac_resources,
2424
},
25-
constants::RESTART_CONTROLLER_ENABLED_LABEL,
2625
iter::reverse_if,
2726
k8s_openapi::{
2827
DeepMerge,
@@ -901,13 +900,12 @@ fn rolegroup_statefulset(
901900
..StatefulSetSpec::default()
902901
};
903902

904-
let sts_metadata = metadata
905-
.clone()
906-
.with_label(RESTART_CONTROLLER_ENABLED_LABEL.to_owned())
907-
.build();
908-
903+
// TODO: The restart-controller is currently not enabled via the label RESTART_CONTROLLER_ENABLED_LABEL.
904+
// This is due to problems that might appear when restarting pods during the initial formatting of namenodes.
905+
// See: https://github.com/stackabletech/hdfs-operator/issues/750 (disable restart-controller)
906+
// https://github.com/stackabletech/issues/issues/816 (enable restart-controller)
909907
Ok(StatefulSet {
910-
metadata: sts_metadata,
908+
metadata: metadata.build(),
911909
spec: Some(statefulset_spec),
912910
status: None,
913911
})

tests/templates/kuttl/smoke/30-assert.yaml.j2

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,6 @@ apiVersion: apps/v1
77
kind: StatefulSet
88
metadata:
99
name: hdfs-namenode-default
10-
generation: 1 # There should be no unneeded Pod restarts
11-
labels:
12-
restarter.stackable.tech/enabled: "true"
1310
spec:
1411
template:
1512
spec:
@@ -35,9 +32,6 @@ apiVersion: apps/v1
3532
kind: StatefulSet
3633
metadata:
3734
name: hdfs-journalnode-default
38-
generation: 1 # There should be no unneeded Pod restarts
39-
labels:
40-
restarter.stackable.tech/enabled: "true"
4135
spec:
4236
template:
4337
spec:
@@ -62,9 +56,6 @@ apiVersion: apps/v1
6256
kind: StatefulSet
6357
metadata:
6458
name: hdfs-datanode-default
65-
generation: 1 # There should be no unneeded Pod restarts
66-
labels:
67-
restarter.stackable.tech/enabled: "true"
6859
spec:
6960
template:
7061
spec:

0 commit comments

Comments
 (0)