Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 26 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,6 @@ Testkit-specific env variables:
- `CSI_CEPH_OSD_STORAGE_CLASS` — pre-existing block-mode StorageClass used to
back Rook OSD PVCs. When empty, a `sds-local-volume` Thick SC is
auto-provisioned via `EnsureDefaultStorageClass`.
- `CSI_CEPH_MODULE_PULL_OVERRIDE` — image tag for `csi-ceph`'s
ModulePullOverride (dev registries only, e.g. when testing a PR build).

#### `modulePullOverride` in `cluster_config.yml`

Expand All @@ -95,7 +93,32 @@ dkpParameters:
modulePullOverride: pr131
```

`${VAR}` placeholders in `modulePullOverride` are rejected at config load time.
`${VAR}` placeholders **inside** `modulePullOverride` are rejected at config
load time — the static file stays literal and readable.

##### Per-module env override (for CI)

To point the module-under-test at a CI image without editing the committed
YAML, set the per-module env var `<MODULE>_MODULE_PULL_OVERRIDE` (the module
name upper-cased, `-`/`.` → `_`). It overrides the static value at load time;
unset modules keep their YAML default. Examples:

- `csi-ceph` → `CSI_CEPH_MODULE_PULL_OVERRIDE`
- `sds-elastic` → `SDS_ELASTIC_MODULE_PULL_OVERRIDE`

```bash
SDS_ELASTIC_MODULE_PULL_OVERRIDE=pr123 go test ./tests/
```

Each applied override is logged at load time, naming both the static tag and
the env var that takes precedence, e.g.:

```
modulePullOverride[sds-elastic]: cluster_config.yml pins tag "main", but SDS_ELASTIC_MODULE_PULL_OVERRIDE="pr123" is set — using tag "pr123" for this test run
```

A single global tag (e.g. `MODULE_IMAGE_TAG`) is intentionally **not** used:
configs with several dev modules would be ambiguous. Use one var per module.

### csi-all-stress-tests

Expand Down
2 changes: 2 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,7 @@ config/
├── config.go # Main configuration operations
├── env.go # Environment variable definitions and validation
├── types.go # Configuration type definitions
├── overrides.go # Per-module modulePullOverride env overrides
└── images.go # OS image URL definitions
```

Expand Down Expand Up @@ -780,6 +781,7 @@ logger.Error("Failed to create resource: %v", err)
| `TEST_CLUSTER_CLEANUP` | `false` | Cleanup cluster after tests |
| `LOG_LEVEL` | `debug` | Log level (debug/info/warn/error) |
| `KUBE_CONFIG_PATH` | - | Explicit kubeconfig path. Used when SSH retrieval of `/etc/kubernetes/{super-admin,admin}.conf` from the master fails. If unset and SSH also fails, `GetKubeconfig` returns an error (no silent fallback to `~/.kube/config`). |
| `<MODULE>_MODULE_PULL_OVERRIDE` | - | Per-module override of a module's `modulePullOverride` at config load (module name upper-cased, non-`[A-Z0-9]` → `_`; e.g. `SDS_ELASTIC_MODULE_PULL_OVERRIDE`, `CSI_CEPH_MODULE_PULL_OVERRIDE`). Replaces the static `cluster_config.yml` tag for CI image builds (`pr<N>`/`mr<N>`); each applied override is logged at INFO. The static YAML stays literal — `${VAR}` inside `modulePullOverride` is still rejected. See `internal/config/overrides.go`. |
### Commander Variables (only when `TEST_CLUSTER_CREATE_MODE=commander`)

| Variable | Default | Description |
Expand Down
90 changes: 34 additions & 56 deletions docs/FUNCTIONS_GLOSSARY.md

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions docs/WORKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,25 @@ All notable changes to this repository are documented here. New entries are appe

---

## 2026-06-15

- **Update** `pkg/kubernetes/nodegroup.go::CreateStaticNodeGroup`: wrap the existence-check + create in `retry.DoVoid` (backoff 2s→15s, ×1.5, 30 attempts, bounded by the caller's `NodeGroupTimeout` context). Right after `dhctl bootstrap` the node-manager validating webhook (`node-controller-webhook` in `d8-cloud-instance-manager`) is frequently still unreachable, so the apiserver rejects the create with a transient `InternalError` (`failed calling webhook ... connect: operation not permitted`). `retry.IsRetryable` already classifies both `InternalError` and `failed calling webhook` as transient; the loop re-reads the NodeGroup each attempt so it stays idempotent even if a prior attempt created it without us seeing the success response.
- **Why**: the suite previously failed deterministically on freshly bootstrapped clusters because the single-shot create raced the webhook's readiness. Retrying converges instead of aborting the whole run.
- **Update** `internal/config/config.go`: `NodeGroupTimeout` 2m → 4m (now a retry budget, not a single attempt) and `SecretsWaitTimeout` 2m → 10m. Bootstrap secret materialization and webhook convergence routinely exceed 2m on slower/nested clusters, so the old values produced spurious bootstrap failures.

---

## 2026-06-14

- **Add** `internal/config/overrides.go` (`ApplyModulePullOverrideEnv`, `EnvKeyForModulePullOverride`, `ModulePullOverrideChange`): per-module env override for `modulePullOverride`, keyed by module name (`sds-elastic` → `SDS_ELASTIC_MODULE_PULL_OVERRIDE`, `csi-ceph` → `CSI_CEPH_MODULE_PULL_OVERRIDE`; name upper-cased, non-`[A-Z0-9]` → `_`). When the var is set it replaces the module's static `modulePullOverride` at config-load time; the YAML keeps a literal default (`main`).
- **Why**: a dozen-plus module e2e suites need to pin the module-under-test to a CI image tag (`pr<N>`/`mr<N>`) without editing the committed `cluster_config.yml`. Per-suite Makefile rendering (envsubst) does not scale, drifts across repos, and breaks plain `go test ./tests/`. Centralizing the substitution in the shared library gives every suite one contract.
- **Why per-module, not a single global `${VAR}`**: directly addresses the 2026-05-20 review objection — a global `MODULE_IMAGE_TAG` is ambiguous when several modules in one config need different tags. A per-module key is explicit and matches the pre-existing `CSI_CEPH_MODULE_PULL_OVERRIDE` README precedent. In-YAML `${...}` stays rejected by `ValidateModulePullOverrides`; the env override is a separate, explicit channel applied right before validation.
- **Update** `pkg/cluster/cluster.go::loadClusterConfigFromPath` and `internal/cluster/cluster.go::LoadClusterConfig`: call `ApplyModulePullOverrideEnv` after `yaml.Unmarshal` / before `ValidateModulePullOverrides`, logging each applied override at INFO and naming BOTH the static `cluster_config.yml` tag and the env var/tag that wins, e.g. `modulePullOverride[sds-elastic]: cluster_config.yml pins tag "main", but SDS_ELASTIC_MODULE_PULL_OVERRIDE="pr123" is set — using tag "pr123" for this test run`.
- **Add** `internal/config/overrides_test.go`: env-key normalization plus override / no-env / equal-value / empty-YAML-default cases.
- **Update** `README.md`: documented the `<MODULE>_MODULE_PULL_OVERRIDE` per-module override, the load-time log line, and why a single global tag is intentionally avoided.

---

## 2026-05-06

- **Add** `UploadPrivate` on `ssh.SSHClient` (`internal/infrastructure/ssh`): SFTP `Chmod` immediately after `Create`, before payload copy; `uploadOverSFTPOnce`, `uploadWithSFTPRetries`, `jumpUploadWithSFTPRetries`; passphrase `BootstrapCluster` uses it with `install -d -m 0700` staging (`pkg/cluster/setup.go`); ARCHITECTURE mentions ssh uploads
Expand Down
7 changes: 7 additions & 0 deletions internal/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,13 @@ func LoadClusterConfig(configFilename string) (*config.ClusterDefinition, error)
return nil, fmt.Errorf("failed to parse YAML config: %w", err)
}

// Apply per-module modulePullOverride env overrides (e.g.
// SDS_ELASTIC_MODULE_PULL_OVERRIDE) before validation, logging each one so
// the running image tag's source is unambiguous.
for _, ch := range config.ApplyModulePullOverrideEnv(&clusterDef) {
logger.Info("%s", ch.LogLine())
}

if err := config.ValidateModulePullOverrides(&clusterDef); err != nil {
return nil, err
}
Expand Down
4 changes: 2 additions & 2 deletions internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ const (
// Kubernetes operations
ModuleCheckTimeout = 10 * time.Second // Timeout for checking module status
NamespaceTimeout = 30 * time.Second // Timeout for creating namespace
NodeGroupTimeout = 2 * time.Minute // Timeout for creating NodeGroup (API can be slow right after bootstrap)
SecretsWaitTimeout = 2 * time.Minute // Timeout for waiting for bootstrap secrets to appear
NodeGroupTimeout = 4 * time.Minute // Timeout (with retries) for creating NodeGroup; the node-manager validating webhook is often unreachable for a while right after bootstrap
SecretsWaitTimeout = 10 * time.Minute // Timeout for waiting for bootstrap secrets to appear
ClusterHealthTimeout = 15 * time.Minute // Timeout for cluster health check
ModuleDeployTimeout = 15 * time.Minute // Timeout for waiting for ONE module to be ready

Expand Down
113 changes: 113 additions & 0 deletions internal/config/overrides.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
/*
Copyright 2026 Flant JSC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package config

import (
"fmt"
"os"
"strings"
)

// ModulePullOverrideEnvSuffix is appended to the normalized module name to form
// the per-module env var that overrides modulePullOverride. For example module
// "sds-elastic" maps to "SDS_ELASTIC_MODULE_PULL_OVERRIDE".
const ModulePullOverrideEnvSuffix = "_MODULE_PULL_OVERRIDE"

// ModulePullOverrideDefaultTag is the image tag storage-e2e applies for dev
// registries when a module declares no modulePullOverride. It is surfaced here
// only so logs can name the effective default when the YAML value was empty.
const ModulePullOverrideDefaultTag = "main"

// EnvKeyForModulePullOverride returns the per-module env var name that overrides
// a module's modulePullOverride. The module name is upper-cased and every
// character invalid in a shell env var (anything outside [A-Z0-9]) is replaced
// with '_', so "sds-elastic" -> "SDS_ELASTIC_MODULE_PULL_OVERRIDE" and
// "csi-ceph" -> "CSI_CEPH_MODULE_PULL_OVERRIDE".
func EnvKeyForModulePullOverride(moduleName string) string {
norm := strings.Map(func(r rune) rune {
switch {
case r >= 'a' && r <= 'z':
return r - ('a' - 'A')
case r >= 'A' && r <= 'Z', r >= '0' && r <= '9':
return r
default:
return '_'
}
}, moduleName)
return norm + ModulePullOverrideEnvSuffix
}

// ModulePullOverrideChange records a single env-driven override of a module's
// modulePullOverride so the caller can log it explicitly.
type ModulePullOverrideChange struct {
Module string // module name
EnvVar string // env var that triggered the override
FromYAML string // value declared in cluster_config.yml ("" when unset)
ToEnv string // value taken from the env var (the effective tag)
}

// LogLine renders a human-readable explanation of the override, naming BOTH the
// static cluster_config.yml value and the env var/tag that takes precedence, so
// the test output makes the source of the running image tag unambiguous.
func (c ModulePullOverrideChange) LogLine() string {
if c.FromYAML == "" {
return fmt.Sprintf(
"modulePullOverride[%s]: cluster_config.yml sets no tag (effective default %q), but %s=%q is set — using tag %q for this test run",
c.Module, ModulePullOverrideDefaultTag, c.EnvVar, c.ToEnv, c.ToEnv,
)
}
return fmt.Sprintf(
"modulePullOverride[%s]: cluster_config.yml pins tag %q, but %s=%q is set — using tag %q for this test run",
c.Module, c.FromYAML, c.EnvVar, c.ToEnv, c.ToEnv,
)
}

// ApplyModulePullOverrideEnv overrides each module's ModulePullOverride from its
// per-module env var (see EnvKeyForModulePullOverride) when that var is set and
// differs from the static value. This is the sanctioned, per-module channel for
// pointing the module-under-test at a CI image tag (pr<N>/mr<N>/main) without
// editing the committed cluster_config.yml — chosen over a single global
// MODULE_IMAGE_TAG so configs with several dev modules stay unambiguous.
//
// In-YAML ${VAR} templating remains unsupported (ValidateModulePullOverrides
// rejects it): the static file keeps literal, readable defaults and this env
// channel is applied right before validation. Returns the applied changes so
// the caller (which owns the logger) can report them; mutates def in place.
func ApplyModulePullOverrideEnv(def *ClusterDefinition) []ModulePullOverrideChange {
if def == nil {
return nil
}
var changes []ModulePullOverrideChange
for _, m := range def.DKPParameters.Modules {
if m == nil {
continue
}
key := EnvKeyForModulePullOverride(m.Name)
val := strings.TrimSpace(os.Getenv(key))
if val == "" || val == m.ModulePullOverride {
continue
}
changes = append(changes, ModulePullOverrideChange{
Module: m.Name,
EnvVar: key,
FromYAML: m.ModulePullOverride,
ToEnv: val,
})
m.ModulePullOverride = val
}
return changes
}
107 changes: 107 additions & 0 deletions internal/config/overrides_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
/*
Copyright 2026 Flant JSC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package config

import (
"strings"
"testing"
)

func TestEnvKeyForModulePullOverride(t *testing.T) {
cases := map[string]string{
"sds-elastic": "SDS_ELASTIC_MODULE_PULL_OVERRIDE",
"csi-ceph": "CSI_CEPH_MODULE_PULL_OVERRIDE",
"sds-node-configurator": "SDS_NODE_CONFIGURATOR_MODULE_PULL_OVERRIDE",
"snapshot-controller": "SNAPSHOT_CONTROLLER_MODULE_PULL_OVERRIDE",
}
for module, want := range cases {
if got := EnvKeyForModulePullOverride(module); got != want {
t.Errorf("EnvKeyForModulePullOverride(%q) = %q, want %q", module, got, want)
}
}
}

func newDef(modules ...*ModuleConfig) *ClusterDefinition {
return &ClusterDefinition{DKPParameters: DKPParameters{Modules: modules}}
}

func TestApplyModulePullOverrideEnv_OverridesAndRecords(t *testing.T) {
t.Setenv("SDS_ELASTIC_MODULE_PULL_OVERRIDE", "pr123")

def := newDef(
&ModuleConfig{Name: "sds-elastic", ModulePullOverride: "main"},
&ModuleConfig{Name: "csi-ceph", ModulePullOverride: "main"},
)

changes := ApplyModulePullOverrideEnv(def)
if len(changes) != 1 {
t.Fatalf("expected 1 change, got %d: %+v", len(changes), changes)
}
if got := def.DKPParameters.Modules[0].ModulePullOverride; got != "pr123" {
t.Errorf("sds-elastic ModulePullOverride = %q, want pr123", got)
}
if got := def.DKPParameters.Modules[1].ModulePullOverride; got != "main" {
t.Errorf("csi-ceph ModulePullOverride = %q, want main (untouched)", got)
}

ch := changes[0]
if ch.Module != "sds-elastic" || ch.EnvVar != "SDS_ELASTIC_MODULE_PULL_OVERRIDE" ||
ch.FromYAML != "main" || ch.ToEnv != "pr123" {
t.Errorf("unexpected change: %+v", ch)
}
line := ch.LogLine()
for _, want := range []string{"sds-elastic", `"main"`, "SDS_ELASTIC_MODULE_PULL_OVERRIDE", `"pr123"`} {
if !strings.Contains(line, want) {
t.Errorf("LogLine() = %q, missing %q", line, want)
}
}
}

func TestApplyModulePullOverrideEnv_NoEnvIsNoop(t *testing.T) {
def := newDef(&ModuleConfig{Name: "sds-elastic", ModulePullOverride: "main"})
if changes := ApplyModulePullOverrideEnv(def); len(changes) != 0 {
t.Fatalf("expected no changes without env, got %+v", changes)
}
if got := def.DKPParameters.Modules[0].ModulePullOverride; got != "main" {
t.Errorf("ModulePullOverride = %q, want main (untouched)", got)
}
}

func TestApplyModulePullOverrideEnv_EqualValueIsNoop(t *testing.T) {
t.Setenv("SDS_ELASTIC_MODULE_PULL_OVERRIDE", "main")
def := newDef(&ModuleConfig{Name: "sds-elastic", ModulePullOverride: "main"})
if changes := ApplyModulePullOverrideEnv(def); len(changes) != 0 {
t.Fatalf("expected no changes when env equals YAML, got %+v", changes)
}
}

func TestApplyModulePullOverrideEnv_EmptyYAMLDefaultLogged(t *testing.T) {
t.Setenv("SDS_ELASTIC_MODULE_PULL_OVERRIDE", "pr7")
def := newDef(&ModuleConfig{Name: "sds-elastic"})

changes := ApplyModulePullOverrideEnv(def)
if len(changes) != 1 {
t.Fatalf("expected 1 change, got %d", len(changes))
}
if got := def.DKPParameters.Modules[0].ModulePullOverride; got != "pr7" {
t.Errorf("ModulePullOverride = %q, want pr7", got)
}
// With no static value the log should name the effective default tag.
if line := changes[0].LogLine(); !strings.Contains(line, ModulePullOverrideDefaultTag) {
t.Errorf("LogLine() = %q, expected to mention default %q", line, ModulePullOverrideDefaultTag)
}
}
7 changes: 7 additions & 0 deletions pkg/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,13 @@ func loadClusterConfigFromPath(configPath string) (*config.ClusterDefinition, er
return nil, fmt.Errorf("failed to parse YAML config: %w", err)
}

// Apply per-module modulePullOverride env overrides (e.g.
// SDS_ELASTIC_MODULE_PULL_OVERRIDE) before validation, logging each one so
// the running image tag's source is unambiguous.
for _, ch := range config.ApplyModulePullOverrideEnv(&clusterDef) {
logger.Info("%s", ch.LogLine())
}

if err := config.ValidateModulePullOverrides(&clusterDef); err != nil {
return nil, err
}
Expand Down
Loading
Loading