[feature] Reusable E2E CI pipeline by NikolayDemchuk · Pull Request #21 · deckhouse/storage-e2e

NikolayDemchuk · 2026-06-04T13:33:32Z

Signed-off-by: Nikolay Demchuk nikolay.demchuk@flant.com

Description

Add a reusable GitHub Actions workflow and cmd/e2e CLI that run the same three-stage E2E pipeline (create-cluster → run-tests → teardown-cluster) through the ClusterProvider SDK, with JUnit publishing and secret handling via temp files.

Why do we need it, and what problem does it solve?

Module repos (starting with sds-node-configurator) need a shared, low-overhead CI pattern for storage E2E without duplicating bash cluster setup and diverging from local make e2e / Ginkgo runs.

What is the expected result?

Callers use deckhouse/storage-e2e/.github/workflows/e2e-reusable.yml with secrets: inherit; jobs provision the cluster via SDK, run tests with optional --label-filter, upload JUnit to PR Checks, and always tear down the cluster in the final job.

Checklist

The code is covered by unit tests.
e2e tests passed.
Documentation updated according to the changes.
Changes were tested in the Kubernetes cluster manually.

mrwesadex-flant · 2026-06-17T15:31:30Z

Claude code review - need to examine properly!

Note on importance: This is an automated multi-angle review, not a verdict. The items below are ranked by my estimate of severity, but each needs a human to confirm against the real deployment. #1 and #2 are the ones to look at first — both are silent wrong-target bugs that only fire on an unhappy path (SSH fetch failure / an optional secret being unset), so CI will stay green until the day they don't. #5 is a PR-level design question worth discussing before merge. The "lower priority" items at the bottom are cleanup, not blockers.

A large PR (~3.6k lines: new cmd/e2e CLI, pkg/provider SDK, reusable workflow, cluster-resume rewrite). Findings ranked most-severe first.

Correctness

1. GetKubeconfig fallback now prefers the BASE cluster over KUBE_CONFIG_PATH — internal/cluster/cluster.go:259
On SSH kubeconfig-fetch failure the new fallback resolves EffectiveBaseKubeConfigPath() (base/virtualization cluster) before config.KubeConfigPath. Every test-cluster ConnectToCluster call leaves KubeconfigFallbackPath empty, so a transient SSH failure (or a sudo -n prompt) to the test master silently loads the base cluster's kubeconfig — tests then acquire locks / deploy modules against the wrong stand. Pre-PR the fallback was the test cluster's KUBE_CONFIG_PATH. Wrong-cluster regression.

2. ReloadFromEnv() discards the defaults ValidateEnvironment applies — internal/config/env.go:383
ReloadFromEnv() resets package config to raw env, dropping VMSSHUser→"cloud", ImagePullPolicy, LogLevel, YAMLConfigFilename defaults. It is now called standalone mid-flow inside ConnectToCluster/connectExistingCluster. When SSH_VM_USER is unset (it's required: false), VMSSHUser is reset to "" after ConnectToCluster, and bootstrap (pkg/cluster/setup.go:500) then fails with VMSSHUser cannot be empty in config — provisioning that previously succeeded on the default now breaks.

3. Rewritten resume path no longer early-returns — pkg/cluster/cluster.go:368
Old resume returned right after reconnecting ("assume should-create-test-cluster already completed"). New resume falls through to steps 7+ (setup-node SSH, BootstrapCluster, AddNodesToCluster, EnableAndConfigureModules) against an already-bootstrapped control plane. Unless every one of those steps is idempotent, a resume re-runs bootstrap/node-join on a live cluster and can fail or corrupt it. TEST_CLUSTER_RESUME semantics changed from "reconnect-only" to "continue-provisioning" — confirm downstream steps tolerate re-execution.

4. Unguarded SSH private-key write produces a bogus key file — .github/workflows/e2e-reusable.yml:332
The inline "Setup test environment" step writes the private key with no emptiness guard (unlike the public key, kubeconfig, and e2e-prepare-env.sh). With E2E_SSH_PRIVATE_KEY unset, it writes a newline-only file, chmod 600 succeeds, and SSH_PRIVATE_KEY points at it. The tunnel/cleanup [ -f "$E2E_SSH_KEY_PATH" ] gates then pass, ssh launches with a garbage identity, the backgrounded tunnel dies, and tests fail with opaque connection errors instead of the clear "tunnel params missing".

6. nil-pointer panic in resume branch — pkg/cluster/cluster.go:369
state, _ := loadClusterState() discards the error, then state.Namespace dereferences. If cluster-state.json is removed/unreadable between the earlier load and this one (concurrent ClearStaleArtifacts, /tmp cleanup, parallel process), this panics and crashes create-cluster instead of erroring cleanly.

7. env-file parsers don't reverse %q escaping — cmd/e2e/main.go:940 (and pkg/provider/envfile.go)
WriteRunEnv emits export KEY=%q, but applyRunEnv/ApplyEnvFileKeys strip only surrounding quotes via strings.Trim(v, '"'). A value containing a backslash or quote (unusual path, storage class, namespace) is restored corrupted into SSH_HOST/KUBE_CONFIG_PATH/TEST_CLUSTER_NAMESPACE.

Altitude / design

5. The SDK the PR adds is never exercised in CI — .github/workflows/e2e-reusable.yml:342
create-cluster and teardown-cluster are mocked no-op echo jobs, and run-tests calls raw go test while re-implementing kubeconfig-server parsing, SSH ProxyJump tunneling and timeout normalization in bash. The PR's premise is "no bash/YAML cluster provisioning, via the ClusterProvider SDK," yet cmd/e2e (EnsureCluster/Teardown) is never invoked. The bash tunnel hardcodes 127.0.0.1:6445 while the Go path derives the port, and the bash duration regex diverges from Go's time.ParseDuration — two implementations that will drift, with the SDK centerpiece uncovered by CI.

8. needsBastionJump hardcodes a lab-network prefix — internal/config/env.go:326
strings.HasPrefix(host, "10.10.10.") bakes a network-topology fact into shared config. Any environment whose internal test network isn't exactly 10.10.10.x gets needsBastionJump=false, skips the jump host, and SSHes directly to an unroutable IP until ConnectTimeout — and the CI guard that would emit "SSH_JUMP_HOST required" never fires.

9. Cleanup deletes namespace before vd/vm, and may scope to the wrong namespace — .github/workflows/e2e-reusable.yml:380
Deleting the namespace first cascades vd/vm, so the later --all deletes are pointless; for finalizer-stuck VirtualDisks/VMs the vd/vm deletes must run first to unblock namespace termination. And under if: always(), if TEST_NAMESPACE was never written to $GITHUB_ENV, kubectl delete vd -n "" --all scopes deletes to the wrong/default namespace on a shared base cluster.

10. Docs claim JUnit/PR-Checks publishing that doesn't exist — docs/CI.md:58
Docs say run-tests uploads junit.xml and publishes PR Checks via dorny/test-reporter, but the workflow has neither — only upload-artifact of the raw log. go test | tee emits no junit, and the checks: write permission is unused. Either wire up the reporter or correct the doc.

Lower priority (cleanup, not blockers)

Byte-identical expandPath/expandPathLocal in pkg/provider/reconnect.go:130 and pkg/cluster/lifecycle.go:114 (the repo already has 3 other variants).
Dead code: closeResources and copyKubeconfigArtifact in cmd/e2e/main.go:445,450 are never called (golangci-lint will flag).
prune_dir and the secret-to-tempfile logic are duplicated between .github/scripts/*.sh and inline workflow steps, and have already begun to diverge.
Efficiency: a single create-cluster --skip-if-ready triggers ~5–6 independent base-cluster SSH+tunnel+kubeconfig cycles; the WaitVMNamesRunning/WaitNamespaceHasActiveCluster poll loops re-establish the full connection every 10s instead of holding one open.

Automated review (Claude Code, multi-angle + verification pass). Severity estimates are mine; please verify against the live setup before acting.

Three-job pipeline (create-cluster → run-tests → teardown-cluster). create-cluster and teardown-cluster are mocked no-ops; run-tests mirrors the build_dev smoke flow: SSH tunnel, go mod replace, go test with Ginkgo label filter and 90m minimum suite timeout. Adds .github/workflows/e2e-reusable.yml and helper scripts. Co-authored-by: Cursor <cursoragent@cursor.com>

github-code-quality · 2026-06-22T13:52:15Z

Code Coverage Overview

Languages: Go

Go / code-coverage/go

The overall coverage remains at 9%, unchanged from the main branch.

_{Updated June 24, 2026 15:21 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.}

- Add noop pipeline_mode to e2e-reusable.yml: create-cluster and run-tests both run with echo-only steps; real steps are gated by pipeline_mode != 'noop'. Allows storage-e2e to self-test the CI pipeline structure without a real cluster. - Parameterise test suite name via test_suite input (default: TestSdsNodeConfigurator) — backward-compatible for existing callers. - Add e2e-self-test.yml: calls reusable workflow in noop mode on PRs touching .github/workflows/e2e-reusable.yml or go.mod/go.sum. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Nikolay Demchuk <nikolay.demchuk@flant.com>

Add skip_storage_e2e_replace boolean input (default: false) to e2e-reusable.yml. When true: - Checkout storage-e2e step is skipped - Setup Go uses only module_path/go.sum (no storage-e2e/go.sum) - go mod edit -replace is skipped in Run E2E tests Update e2e-self-test.yml to pass skip_storage_e2e_replace: true and set correct test_package/test_suite for storage-e2e's own tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Nikolay Demchuk <nikolay.demchuk@flant.com>

fastrapier · 2026-06-24T14:00:27Z

@copilot resolve the merge conflicts in this pull request

Copilot · 2026-06-24T14:12:21Z

@copilot resolve the merge conflicts in this pull request

Done. I resolved the merge conflict and merged origin/main into this branch in commit fec215f.

github-actions · 2026-06-24T15:21:30Z

Package	Line Rate	Health
github.com/deckhouse/storage-e2e/internal/cluster	0%	❌
github.com/deckhouse/storage-e2e/internal/config	45%	❌
github.com/deckhouse/storage-e2e/internal/infrastructure/ssh	0%	❌
github.com/deckhouse/storage-e2e/internal/kubernetes/commander	48%	❌
github.com/deckhouse/storage-e2e/internal/kubernetes/deckhouse	0%	❌
github.com/deckhouse/storage-e2e/internal/kubernetes/storage	0%	❌
github.com/deckhouse/storage-e2e/internal/kubernetes/virtualization	0%	❌
github.com/deckhouse/storage-e2e/internal/logger	51%	❌
github.com/deckhouse/storage-e2e/internal/provisioning/dvp	7%	❌
github.com/deckhouse/storage-e2e/pkg/cluster	4%	❌
github.com/deckhouse/storage-e2e/pkg/clusterprovider	89%	✔
github.com/deckhouse/storage-e2e/pkg/clusterprovider/registry	100%	✔
github.com/deckhouse/storage-e2e/pkg/kubernetes	5%	❌
github.com/deckhouse/storage-e2e/pkg/retry	94%	✔
github.com/deckhouse/storage-e2e/pkg/storage-e2e	0%	❌
github.com/deckhouse/storage-e2e/pkg/testkit	4%	❌
Summary	9% (1168 / 12416)	❌

NikolayDemchuk force-pushed the ndemchuk-reusable-ci branch from 1069667 to 398effa Compare June 22, 2026 13:51

NikolayDemchuk and others added 4 commits June 23, 2026 00:35

added new vars

21ec974

Signed-off-by: Nikolay Demchuk <nikolay.demchuk@flant.com>

fix

b82aab6

Signed-off-by: Nikolay Demchuk <nikolay.demchuk@flant.com>

fastrapier previously approved these changes Jun 24, 2026

View reviewed changes

fastrapier marked this pull request as ready for review June 24, 2026 14:00

Copilot started work on behalf of fastrapier June 24, 2026 14:00 View session

Merge origin/main into ndemchuk-reusable-ci

fec215f

Copilot AI dismissed fastrapier’s stale review via fec215f June 24, 2026 14:12

Copilot finished work on behalf of fastrapier June 24, 2026 14:12

Copilot AI requested a review from fastrapier June 24, 2026 14:12

fastrapier approved these changes Jun 24, 2026

View reviewed changes

fastrapier merged commit 229345b into main Jun 24, 2026
8 checks passed

fastrapier deleted the ndemchuk-reusable-ci branch June 24, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Reusable E2E CI pipeline#21

[feature] Reusable E2E CI pipeline#21
fastrapier merged 6 commits into
mainfrom
ndemchuk-reusable-ci

NikolayDemchuk commented Jun 4, 2026

Uh oh!

mrwesadex-flant commented Jun 17, 2026

Uh oh!

github-code-quality Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

fastrapier commented Jun 24, 2026

Uh oh!

Copilot AI commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NikolayDemchuk commented Jun 4, 2026

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

Uh oh!

mrwesadex-flant commented Jun 17, 2026

Claude code review - need to examine properly!

Correctness

Altitude / design

Lower priority (cleanup, not blockers)

Uh oh!

github-code-quality Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Overview

Go / code-coverage/go

Uh oh!

fastrapier commented Jun 24, 2026

Uh oh!

Copilot AI commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-code-quality Bot commented Jun 22, 2026 •

edited

Loading