PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630
PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630james-nesbitt wants to merge 5 commits into
Conversation
|
This is waiting for input from @trifo13 |
4df8031 to
03683d2
Compare
03683d2 to
6740adc
Compare
| // Disable the container-tools module stream before MCR install. RHEL8 | ||
| // AppStream pulls in system runc as a container-selinux dependency; that | ||
| // package conflicts with Mirantis's containerd.io-runc at install time. | ||
| UserData: "sudo dnf module disable container-tools -y; sudo firewall-cmd --permanent --add-port=2377/tcp --add-port=7946/tcp --add-port=7946/udp --add-port=4789/udp --add-port=10250/tcp; sudo firewall-cmd --reload", |
There was a problem hiding this comment.
there should be a way to pass in custom userdata in the nodegroup variables instead of changing the default. does it make sense to always use this override for all tests? if so, does this need to be applied to the other rhel versions instead of just rhel 8 ?
There was a problem hiding this comment.
Pull request overview
Adds a new smoke-test scenario to exercise a customer-style “airgapped” multi-hop upgrade path, plus supporting Terraform/CI/doc updates so the scenario can be provisioned and run in automation.
Changes:
- Added
TestAirgappedMultiHopUpgradesmoke test that provisions MSR behind a non-standard external port and performs multi-step MKE/MCR upgrades usingimageRepooverrides. - Extended the shared AWS Terraform smoke module with an
msr_portinput and updated MSR ingress/--dtr-external-urlgeneration. - Updated CI workflow and developer/agent documentation for smoke-test authoring and execution.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
test/smoke/airgapped_multi_hop_upgrade_test.go |
New multi-hop “airgapped” upgrade smoke test (custom registry port + sequential upgrade chain). |
test/platforms.go |
Adjusts RHEL8 userdata to avoid MCR install conflicts by disabling container-tools module stream. |
Makefile |
Adds smoke-airgapped-multi-hop make target with extended timeout. |
examples/terraform/aws-simple/variables.tf |
Introduces msr_port variable (default 443). |
examples/terraform/aws-simple/launchpad.tf |
Wires msr_port into MSR NLB ingress and --dtr-external-url. |
examples/terraform/aws-simple/.terraform.lock.hcl |
Updates provider constraint/hashes after Terraform init/upgrade. |
docs/specifications/architecture.md |
Refreshes architecture/spec details (schema, phases, flows). |
docs/development/workflow.md |
Expands testing guidance and smoke-test workflow documentation. |
docs/development/smoke-tests.md |
New smoke-test authoring guide and CI wiring checklist. |
CLAUDE.md |
Removed (superseded by AGENTS.md standard). |
AI_AGENTS.md |
Removed (superseded by AGENTS.md standard). |
AGENTS.md |
New consolidated agent instructions (AGENTS.md open standard). |
.github/workflows/smoke-tests.yaml |
Adds smoke-airgapped-multi-hop CI job with long timeout and label gate. |
Files not reviewed (1)
- examples/terraform/aws-simple/.terraform.lock.hcl: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
## Smoke test Added TestUpgradeLegacyToModern (test/smoke/upgrade_test.go): - Provisions RHEL8/Rocky8/Ubuntu22, installs MCR stable-25.0 / MKE 3.8.8, then upgrades in place to MCR stable-29.2 / MKE 3.9.2 via a second Apply(). - runUpgradeTest() helper mirrors runSmokeTest() structure (defer destroy, resource tagging, temp SSH dir). - bumpVersions() unmarshals Terraform-generated launchpad_yaml, updates spec.mcr.channel and spec.mke.version, re-marshals — preserving host addresses, SANs, LB names, and install flags verbatim. - make smoke-upgrade target (90m timeout). - smoke-upgrade CI job in .github/workflows/smoke-tests.yaml, gated by smoke-upgrade or smoke-test PR label. CI result: PASS (run 25721416884, 1320s). ## Documentation AGENTS.md (replaces CLAUDE.md + AI_AGENTS.md): - Consolidated into the AGENTS.md open standard (Linux Foundation, supported by Claude Code, Cursor, Windsurf, Codex, Gemini CLI, Aider, and others). - Covers: project overview, non-negotiable rules, build/test commands, phase manager architecture, config schema (v1.6 with example), smoke test reference table, contributing guidelines, multi-engineer workflow guidance, and documentation index. docs/development/smoke-tests.md (new): - Complete authoring guide for new smoke tests: framework mechanics, all 14 available platforms, runSmokeTest / runUpgradeTest / bumpVersions usage with annotated examples, Windows-specific requirements, CI wiring (Makefile + workflow + PR label), timeout guidance, Reset best-effort rationale, and a pre-submission checklist. docs/development/workflow.md: - Replace stale smoke-small/smoke-full with actual four targets. - Add --tags testing note for unit tests. - Remove non-existent make build-release / make sign-release. - Add current apiVersion to schema safety guideline. docs/specifications/architecture.md: - Fix package path: pkg/product/mke/api -> pkg/product/mke/config. - Add current apiVersion and full spec structure. - Add abridged Apply and Reset phase sequences. - Document UninstallMKE swarm dissolution fallback (PRODENG-3442). Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
…enario) Adds TestAirgappedMultiHopUpgrade, a smoke test that exercises the full upgrade chain similar as to what CSO EMEA observed in a specific customer scenario: install with MCR 25.0 / MKE 3.8.8 / MSR 2.9.27, then upgrade through 3.8.11 → 3.8.12 (MCR 29.2) → 3.9.2 → latest MKE 3.x / MCR 29.x. All post-install upgrade steps pull images from an internal DTR exposed on a non-standard port (4443) via an NLB, simulating an airgapped registry configuration. Key design decisions -------------------- - Image preload strategy: rather than pushing images to DTR (which requires namespace provisioning and hits DTR auth edge cases), all upgrade images are pulled from docker.io/mirantis on every node and tagged with the DTR registry address. Launchpad's "Pull MKE images" phase runs docker image inspect before docker pull; finding the image locally it skips the pull entirely. This exercises Launchpad's imageRepo feature without requiring actual DTR push/pull. - SSH key compatibility: Terraform's tls_private_key emits OpenSSH-format ed25519 keys that golang.org/x/crypto/ssh.ParsePrivateKey rejects. All remote commands use the system ssh binary (sshRun/sshRunScript) to avoid Go-side key parsing. - DTR image listing: UCP bootstrapper uses "images --list"; DTR 2.x uses "images" (the --list flag is unrecognised and causes help text on stdout with exit 0). The preload script filters output for valid image-reference patterns and falls back to the plain "images" subcommand automatically. - Dynamic latest-version step: fetchLatestMKEVersion queries Docker Hub tags for mirantis/ucp; fetchLatestMCRChannel probes the Mirantis apt repository (channels are non-sequential so the probe scans the full range rather than stopping at the first 404). The dynamic step is appended only when it differs from the last fixed step. Supporting changes ------------------ - examples/terraform/aws-simple: add msr_port variable (default 443) so the NLB can expose DTR on a non-standard port; all other smoke tests are unaffected. --dtr-external-url appends :PORT only when port != 443. - test/platforms.go: fix RHEL8 MCR install — disable the container-tools module stream before installing MCR to prevent the system runc from conflicting with Mirantis's containerd.io-runc. - Makefile: add smoke-airgapped-multi-hop target (-timeout 200m). - .github/workflows/smoke-tests.yaml: add smoke-airgapped-multi-hop CI job triggered by the smoke-test or smoke-airgapped-multi-hop PR labels. Tested: two passing runs (2537s with 3 fixed steps; 3329s with 4 steps including the dynamic latest-version step MKE 3.9.3 / MCR stable-29.4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove smoke-test generic label from airgapped-multi-hop CI trigger (james-nesbitt): job should only run when explicitly labeled, not with the catch-all smoke-test label that triggers all other jobs - Apply container-tools module fix to Rocky8 platform as well as RHEL8 (james-nesbitt): Rocky8 has the same AppStream runc conflict with containerd.io-runc; the fix belongs in the shared platform defaults - Fix inaccurate header comment: test does not push/pull against DTR; it tags images locally and Launchpad resolves imageRepo to local tags (Copilot) - Add HTTP timeouts (30s/15s) and status-code check to fetchLatestMKEVersion and fetchLatestMCRChannel (Copilot) - Update lastProduct before checking Apply() error so Reset() uses the config from the attempted step, not the last successful step (Copilot) - Rename smokeConfig.Name to airgapped-multi-hop to align AWS resource tags with the Makefile target and CI label (Copilot) - Add smoke-airgapped-multi-hop to AGENTS.md, workflow.md, and smoke-tests.md; update timeout guidance for multi-hop scenarios (Copilot) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
smokeConfig.Name drives the Terraform stack name as
"smoke-{Name}-{5-char-random}"; Terraform then appends
suffixes like "-mke-kube" for target groups, capping the
total at 32 chars. "airgapped-multi-hop" (19 chars) overflows
that limit — revert to "airgap-mhop" (11 chars, same limit as
the original "airgappedup"). Add a comment explaining why the
short name is necessary despite the longer CI label.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove push trigger from smoke-airgapped-multi-hop CI job; it is now label-gated only (prevents the 200m job from running on every main push) - Fix inaccurate file-level comment: images are pre-tagged locally on each node, not pushed into DTR - Gate dynamic 'upgrade to latest' step behind SMOKE_UPGRADE_TO_LATEST env var so PR smoke runs are reproducible; set the var in scheduled jobs - Clean up redundant Terraform provider constraint (>= 6.28.0, >= 6.33.0) to the effective lower bound (>= 6.33.0) Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
c32367b to
42ae520
Compare
|
Addressed the open review feedback in commit 42ae520:
Branch has been rebased onto current main. Written by AI: claude-sonnet-4-5 |
Jira: https://mirantis.jira.com/browse/PRODENG-3446
Summary
Extends the smoke test suite to cover additional customer-specific deployment scenarios. Two engineers are collaborating on this branch; each engineer adds tests in a dedicated file under
test/smoke/to avoid merge conflicts.What is here so far
test/smoke/upgrade_test.go—TestUpgradeLegacyToModern: installs MCRstable-25.0/ MKE3.8.8on RHEL8/Rocky8/Ubuntu22, then upgrades in place to MCRstable-29.2/ MKE3.9.2. CI verified passing (run 25721416884, 1320s).make smoke-upgradeMakefile target (90m timeout).smoke-upgradeCI job in.github/workflows/smoke-tests.yaml.AGENTS.md— consolidated fromCLAUDE.md+AI_AGENTS.mdinto the cross-agent open standard; updated with multi-engineer workflow guidance.docs/development/smoke-tests.md— new authoring guide covering the full framework, platform registry, helper usage, CI wiring, and a pre-submission checklist.docs/development/workflow.md,docs/specifications/architecture.md— corrected stale references and added current schema/phase sequence detail.Adding tests to this PR
Read
docs/development/smoke-tests.mdfor the complete guide. In short:test/smoke/<scenario>_test.goin packagesmoke_test.runSmokeTest(t, smokeConfig{...})orrunUpgradeTest(t, upgradeConfig{...}).Makefiletarget and a.github/workflows/smoke-tests.yamljob.gh label create smoke-<scenario>).Constraints
stable-29.2), not explicit version.SSHKeyAlgorithm: "rsa"and a Linux manager.smoke_test.goorupgrade_test.go.