Skip to content

[feature] Reusable E2E CI pipeline#21

Merged
fastrapier merged 6 commits into
mainfrom
ndemchuk-reusable-ci
Jun 24, 2026
Merged

[feature] Reusable E2E CI pipeline#21
fastrapier merged 6 commits into
mainfrom
ndemchuk-reusable-ci

Conversation

@NikolayDemchuk

Copy link
Copy Markdown
Contributor

Signed-off-by: Nikolay Demchuk nikolay.demchuk@flant.com

Description

Add a reusable GitHub Actions workflow and cmd/e2e CLI that run the same three-stage E2E pipeline (create-clusterrun-teststeardown-cluster) through the ClusterProvider SDK, with JUnit publishing and secret handling via temp files.

Why do we need it, and what problem does it solve?

Module repos (starting with sds-node-configurator) need a shared, low-overhead CI pattern for storage E2E without duplicating bash cluster setup and diverging from local make e2e / Ginkgo runs.

What is the expected result?

Callers use deckhouse/storage-e2e/.github/workflows/e2e-reusable.yml with secrets: inherit; jobs provision the cluster via SDK, run tests with optional --label-filter, upload JUnit to PR Checks, and always tear down the cluster in the final job.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

@mrwesadex-flant

Copy link
Copy Markdown
Collaborator

Claude code review - need to examine properly!

Note on importance: This is an automated multi-angle review, not a verdict. The items below are ranked by my estimate of severity, but each needs a human to confirm against the real deployment. #1 and #2 are the ones to look at first — both are silent wrong-target bugs that only fire on an unhappy path (SSH fetch failure / an optional secret being unset), so CI will stay green until the day they don't. #5 is a PR-level design question worth discussing before merge. The "lower priority" items at the bottom are cleanup, not blockers.

A large PR (~3.6k lines: new cmd/e2e CLI, pkg/provider SDK, reusable workflow, cluster-resume rewrite). Findings ranked most-severe first.

Correctness

1. GetKubeconfig fallback now prefers the BASE cluster over KUBE_CONFIG_PATHinternal/cluster/cluster.go:259
On SSH kubeconfig-fetch failure the new fallback resolves EffectiveBaseKubeConfigPath() (base/virtualization cluster) before config.KubeConfigPath. Every test-cluster ConnectToCluster call leaves KubeconfigFallbackPath empty, so a transient SSH failure (or a sudo -n prompt) to the test master silently loads the base cluster's kubeconfig — tests then acquire locks / deploy modules against the wrong stand. Pre-PR the fallback was the test cluster's KUBE_CONFIG_PATH. Wrong-cluster regression.

2. ReloadFromEnv() discards the defaults ValidateEnvironment appliesinternal/config/env.go:383
ReloadFromEnv() resets package config to raw env, dropping VMSSHUser→"cloud", ImagePullPolicy, LogLevel, YAMLConfigFilename defaults. It is now called standalone mid-flow inside ConnectToCluster/connectExistingCluster. When SSH_VM_USER is unset (it's required: false), VMSSHUser is reset to "" after ConnectToCluster, and bootstrap (pkg/cluster/setup.go:500) then fails with VMSSHUser cannot be empty in config — provisioning that previously succeeded on the default now breaks.

3. Rewritten resume path no longer early-returnspkg/cluster/cluster.go:368
Old resume returned right after reconnecting ("assume should-create-test-cluster already completed"). New resume falls through to steps 7+ (setup-node SSH, BootstrapCluster, AddNodesToCluster, EnableAndConfigureModules) against an already-bootstrapped control plane. Unless every one of those steps is idempotent, a resume re-runs bootstrap/node-join on a live cluster and can fail or corrupt it. TEST_CLUSTER_RESUME semantics changed from "reconnect-only" to "continue-provisioning" — confirm downstream steps tolerate re-execution.

4. Unguarded SSH private-key write produces a bogus key file.github/workflows/e2e-reusable.yml:332
The inline "Setup test environment" step writes the private key with no emptiness guard (unlike the public key, kubeconfig, and e2e-prepare-env.sh). With E2E_SSH_PRIVATE_KEY unset, it writes a newline-only file, chmod 600 succeeds, and SSH_PRIVATE_KEY points at it. The tunnel/cleanup [ -f "$E2E_SSH_KEY_PATH" ] gates then pass, ssh launches with a garbage identity, the backgrounded tunnel dies, and tests fail with opaque connection errors instead of the clear "tunnel params missing".

6. nil-pointer panic in resume branchpkg/cluster/cluster.go:369
state, _ := loadClusterState() discards the error, then state.Namespace dereferences. If cluster-state.json is removed/unreadable between the earlier load and this one (concurrent ClearStaleArtifacts, /tmp cleanup, parallel process), this panics and crashes create-cluster instead of erroring cleanly.

7. env-file parsers don't reverse %q escapingcmd/e2e/main.go:940 (and pkg/provider/envfile.go)
WriteRunEnv emits export KEY=%q, but applyRunEnv/ApplyEnvFileKeys strip only surrounding quotes via strings.Trim(v, '"'). A value containing a backslash or quote (unusual path, storage class, namespace) is restored corrupted into SSH_HOST/KUBE_CONFIG_PATH/TEST_CLUSTER_NAMESPACE.

Altitude / design

5. The SDK the PR adds is never exercised in CI.github/workflows/e2e-reusable.yml:342
create-cluster and teardown-cluster are mocked no-op echo jobs, and run-tests calls raw go test while re-implementing kubeconfig-server parsing, SSH ProxyJump tunneling and timeout normalization in bash. The PR's premise is "no bash/YAML cluster provisioning, via the ClusterProvider SDK," yet cmd/e2e (EnsureCluster/Teardown) is never invoked. The bash tunnel hardcodes 127.0.0.1:6445 while the Go path derives the port, and the bash duration regex diverges from Go's time.ParseDuration — two implementations that will drift, with the SDK centerpiece uncovered by CI.

8. needsBastionJump hardcodes a lab-network prefixinternal/config/env.go:326
strings.HasPrefix(host, "10.10.10.") bakes a network-topology fact into shared config. Any environment whose internal test network isn't exactly 10.10.10.x gets needsBastionJump=false, skips the jump host, and SSHes directly to an unroutable IP until ConnectTimeout — and the CI guard that would emit "SSH_JUMP_HOST required" never fires.

9. Cleanup deletes namespace before vd/vm, and may scope to the wrong namespace.github/workflows/e2e-reusable.yml:380
Deleting the namespace first cascades vd/vm, so the later --all deletes are pointless; for finalizer-stuck VirtualDisks/VMs the vd/vm deletes must run first to unblock namespace termination. And under if: always(), if TEST_NAMESPACE was never written to $GITHUB_ENV, kubectl delete vd -n "" --all scopes deletes to the wrong/default namespace on a shared base cluster.

10. Docs claim JUnit/PR-Checks publishing that doesn't existdocs/CI.md:58
Docs say run-tests uploads junit.xml and publishes PR Checks via dorny/test-reporter, but the workflow has neither — only upload-artifact of the raw log. go test | tee emits no junit, and the checks: write permission is unused. Either wire up the reporter or correct the doc.

Lower priority (cleanup, not blockers)

  • Byte-identical expandPath/expandPathLocal in pkg/provider/reconnect.go:130 and pkg/cluster/lifecycle.go:114 (the repo already has 3 other variants).
  • Dead code: closeResources and copyKubeconfigArtifact in cmd/e2e/main.go:445,450 are never called (golangci-lint will flag).
  • prune_dir and the secret-to-tempfile logic are duplicated between .github/scripts/*.sh and inline workflow steps, and have already begun to diverge.
  • Efficiency: a single create-cluster --skip-if-ready triggers ~5–6 independent base-cluster SSH+tunnel+kubeconfig cycles; the WaitVMNamesRunning/WaitNamespaceHasActiveCluster poll loops re-establish the full connection every 10s instead of holding one open.

Automated review (Claude Code, multi-angle + verification pass). Severity estimates are mine; please verify against the live setup before acting.

Three-job pipeline (create-cluster → run-tests → teardown-cluster).
create-cluster and teardown-cluster are mocked no-ops; run-tests mirrors
the build_dev smoke flow: SSH tunnel, go mod replace, go test with
Ginkgo label filter and 90m minimum suite timeout.

Adds .github/workflows/e2e-reusable.yml and helper scripts.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-code-quality

github-code-quality Bot commented Jun 22, 2026

Copy link
Copy Markdown

Code Coverage Overview

Languages: Go

Go / code-coverage/go

The overall coverage remains at 9%, unchanged from the branch.


Updated June 24, 2026 15:21 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

NikolayDemchuk and others added 4 commits June 23, 2026 00:35
- Add noop pipeline_mode to e2e-reusable.yml: create-cluster and
  run-tests both run with echo-only steps; real steps are gated by
  pipeline_mode != 'noop'. Allows storage-e2e to self-test the CI
  pipeline structure without a real cluster.
- Parameterise test suite name via test_suite input (default:
  TestSdsNodeConfigurator) — backward-compatible for existing callers.
- Add e2e-self-test.yml: calls reusable workflow in noop mode on PRs
  touching .github/workflows/e2e-reusable.yml or go.mod/go.sum.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nikolay Demchuk <nikolay.demchuk@flant.com>
Add skip_storage_e2e_replace boolean input (default: false) to
e2e-reusable.yml. When true:
- Checkout storage-e2e step is skipped
- Setup Go uses only module_path/go.sum (no storage-e2e/go.sum)
- go mod edit -replace is skipped in Run E2E tests

Update e2e-self-test.yml to pass skip_storage_e2e_replace: true
and set correct test_package/test_suite for storage-e2e's own tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nikolay Demchuk <nikolay.demchuk@flant.com>
fastrapier
fastrapier previously approved these changes Jun 24, 2026
@fastrapier fastrapier marked this pull request as ready for review June 24, 2026 14:00
@fastrapier

Copy link
Copy Markdown
Member

@copilot resolve the merge conflicts in this pull request

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Done. I resolved the merge conflict and merged origin/main into this branch in commit fec215f.

Copilot AI requested a review from fastrapier June 24, 2026 14:12
@github-actions

Copy link
Copy Markdown

Code Coverage

Package Line Rate Health
github.com/deckhouse/storage-e2e/internal/cluster 0%
github.com/deckhouse/storage-e2e/internal/config 45%
github.com/deckhouse/storage-e2e/internal/infrastructure/ssh 0%
github.com/deckhouse/storage-e2e/internal/kubernetes/commander 48%
github.com/deckhouse/storage-e2e/internal/kubernetes/deckhouse 0%
github.com/deckhouse/storage-e2e/internal/kubernetes/storage 0%
github.com/deckhouse/storage-e2e/internal/kubernetes/virtualization 0%
github.com/deckhouse/storage-e2e/internal/logger 51%
github.com/deckhouse/storage-e2e/internal/provisioning/dvp 7%
github.com/deckhouse/storage-e2e/pkg/cluster 4%
github.com/deckhouse/storage-e2e/pkg/clusterprovider 89%
github.com/deckhouse/storage-e2e/pkg/clusterprovider/registry 100%
github.com/deckhouse/storage-e2e/pkg/kubernetes 5%
github.com/deckhouse/storage-e2e/pkg/retry 94%
github.com/deckhouse/storage-e2e/pkg/storage-e2e 0%
github.com/deckhouse/storage-e2e/pkg/testkit 4%
Summary 9% (1168 / 12416)

@fastrapier fastrapier merged commit 229345b into main Jun 24, 2026
8 checks passed
@fastrapier fastrapier deleted the ndemchuk-reusable-ci branch June 24, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants