Skip to content

PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630

Open
james-nesbitt wants to merge 5 commits into
mainfrom
PRODENG-3446-smoke-test-vocalink
Open

PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630
james-nesbitt wants to merge 5 commits into
mainfrom
PRODENG-3446-smoke-test-vocalink

Conversation

@james-nesbitt
Copy link
Copy Markdown
Collaborator

Jira: https://mirantis.jira.com/browse/PRODENG-3446

Summary

Extends the smoke test suite to cover additional customer-specific deployment scenarios. Two engineers are collaborating on this branch; each engineer adds tests in a dedicated file under test/smoke/ to avoid merge conflicts.

What is here so far

  • test/smoke/upgrade_test.goTestUpgradeLegacyToModern: installs MCR stable-25.0 / MKE 3.8.8 on RHEL8/Rocky8/Ubuntu22, then upgrades in place to MCR stable-29.2 / MKE 3.9.2. CI verified passing (run 25721416884, 1320s).
  • make smoke-upgrade Makefile target (90m timeout).
  • smoke-upgrade CI job in .github/workflows/smoke-tests.yaml.
  • AGENTS.md — consolidated from CLAUDE.md + AI_AGENTS.md into the cross-agent open standard; updated with multi-engineer workflow guidance.
  • docs/development/smoke-tests.md — new authoring guide covering the full framework, platform registry, helper usage, CI wiring, and a pre-submission checklist.
  • docs/development/workflow.md, docs/specifications/architecture.md — corrected stale references and added current schema/phase sequence detail.

Adding tests to this PR

Read docs/development/smoke-tests.md for the complete guide. In short:

  1. Create test/smoke/<scenario>_test.go in package smoke_test.
  2. Call runSmokeTest(t, smokeConfig{...}) or runUpgradeTest(t, upgradeConfig{...}).
  3. Add a Makefile target and a .github/workflows/smoke-tests.yaml job.
  4. Create a PR label (gh label create smoke-<scenario>).
  5. Push and add your label to trigger only your test.

Constraints

  • No customer names in any code, comment, commit message, or resource tag.
  • MCR must be specified by channel (stable-29.2), not explicit version.
  • Windows clusters require SSHKeyAlgorithm: "rsa" and a Linux manager.
  • One file per engineer / scenario — do not modify smoke_test.go or upgrade_test.go.

@james-nesbitt
Copy link
Copy Markdown
Collaborator Author

This is waiting for input from @trifo13

@trifo13 trifo13 force-pushed the PRODENG-3446-smoke-test-vocalink branch from 4df8031 to 03683d2 Compare May 27, 2026 10:59
@trifo13 trifo13 changed the title PRODENG-3446: extend smoke test suite with additional customer scenario tests PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario) May 27, 2026
@trifo13 trifo13 force-pushed the PRODENG-3446-smoke-test-vocalink branch from 03683d2 to 6740adc Compare May 27, 2026 11:09
Comment thread .github/workflows/smoke-tests.yaml Outdated
Comment thread test/platforms.go
// Disable the container-tools module stream before MCR install. RHEL8
// AppStream pulls in system runc as a container-selinux dependency; that
// package conflicts with Mirantis's containerd.io-runc at install time.
UserData: "sudo dnf module disable container-tools -y; sudo firewall-cmd --permanent --add-port=2377/tcp --add-port=7946/tcp --add-port=7946/udp --add-port=4789/udp --add-port=10250/tcp; sudo firewall-cmd --reload",
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a way to pass in custom userdata in the nodegroup variables instead of changing the default. does it make sense to always use this override for all tests? if so, does this need to be applied to the other rhel versions instead of just rhel 8 ?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new smoke-test scenario to exercise a customer-style “airgapped” multi-hop upgrade path, plus supporting Terraform/CI/doc updates so the scenario can be provisioned and run in automation.

Changes:

  • Added TestAirgappedMultiHopUpgrade smoke test that provisions MSR behind a non-standard external port and performs multi-step MKE/MCR upgrades using imageRepo overrides.
  • Extended the shared AWS Terraform smoke module with an msr_port input and updated MSR ingress/--dtr-external-url generation.
  • Updated CI workflow and developer/agent documentation for smoke-test authoring and execution.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test/smoke/airgapped_multi_hop_upgrade_test.go New multi-hop “airgapped” upgrade smoke test (custom registry port + sequential upgrade chain).
test/platforms.go Adjusts RHEL8 userdata to avoid MCR install conflicts by disabling container-tools module stream.
Makefile Adds smoke-airgapped-multi-hop make target with extended timeout.
examples/terraform/aws-simple/variables.tf Introduces msr_port variable (default 443).
examples/terraform/aws-simple/launchpad.tf Wires msr_port into MSR NLB ingress and --dtr-external-url.
examples/terraform/aws-simple/.terraform.lock.hcl Updates provider constraint/hashes after Terraform init/upgrade.
docs/specifications/architecture.md Refreshes architecture/spec details (schema, phases, flows).
docs/development/workflow.md Expands testing guidance and smoke-test workflow documentation.
docs/development/smoke-tests.md New smoke-test authoring guide and CI wiring checklist.
CLAUDE.md Removed (superseded by AGENTS.md standard).
AI_AGENTS.md Removed (superseded by AGENTS.md standard).
AGENTS.md New consolidated agent instructions (AGENTS.md open standard).
.github/workflows/smoke-tests.yaml Adds smoke-airgapped-multi-hop CI job with long timeout and label gate.
Files not reviewed (1)
  • examples/terraform/aws-simple/.terraform.lock.hcl: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/smoke/airgapped_multi_hop_upgrade_test.go Outdated
Comment thread test/smoke/airgapped_multi_hop_upgrade_test.go Outdated
Comment thread test/smoke/airgapped_multi_hop_upgrade_test.go Outdated
Comment thread test/smoke/airgapped_multi_hop_upgrade_test.go
Comment thread test/smoke/airgapped_multi_hop_upgrade_test.go
Comment thread examples/terraform/aws-simple/.terraform.lock.hcl
Comment thread docs/development/workflow.md Outdated
Comment thread AGENTS.md Outdated
Comment thread docs/development/smoke-tests.md
Comment thread test/smoke/airgapped_multi_hop_upgrade_test.go
james-nesbitt and others added 5 commits June 2, 2026 16:43
## Smoke test

Added TestUpgradeLegacyToModern (test/smoke/upgrade_test.go):
- Provisions RHEL8/Rocky8/Ubuntu22, installs MCR stable-25.0 / MKE 3.8.8,
  then upgrades in place to MCR stable-29.2 / MKE 3.9.2 via a second Apply().
- runUpgradeTest() helper mirrors runSmokeTest() structure (defer destroy,
  resource tagging, temp SSH dir).
- bumpVersions() unmarshals Terraform-generated launchpad_yaml, updates
  spec.mcr.channel and spec.mke.version, re-marshals — preserving host
  addresses, SANs, LB names, and install flags verbatim.
- make smoke-upgrade target (90m timeout).
- smoke-upgrade CI job in .github/workflows/smoke-tests.yaml, gated by
  smoke-upgrade or smoke-test PR label.

CI result: PASS (run 25721416884, 1320s).

## Documentation

AGENTS.md (replaces CLAUDE.md + AI_AGENTS.md):
- Consolidated into the AGENTS.md open standard (Linux Foundation, supported
  by Claude Code, Cursor, Windsurf, Codex, Gemini CLI, Aider, and others).
- Covers: project overview, non-negotiable rules, build/test commands,
  phase manager architecture, config schema (v1.6 with example), smoke test
  reference table, contributing guidelines, multi-engineer workflow guidance,
  and documentation index.

docs/development/smoke-tests.md (new):
- Complete authoring guide for new smoke tests: framework mechanics,
  all 14 available platforms, runSmokeTest / runUpgradeTest / bumpVersions
  usage with annotated examples, Windows-specific requirements, CI wiring
  (Makefile + workflow + PR label), timeout guidance, Reset best-effort
  rationale, and a pre-submission checklist.

docs/development/workflow.md:
- Replace stale smoke-small/smoke-full with actual four targets.
- Add --tags testing note for unit tests.
- Remove non-existent make build-release / make sign-release.
- Add current apiVersion to schema safety guideline.

docs/specifications/architecture.md:
- Fix package path: pkg/product/mke/api -> pkg/product/mke/config.
- Add current apiVersion and full spec structure.
- Add abridged Apply and Reset phase sequences.
- Document UninstallMKE swarm dissolution fallback (PRODENG-3442).

Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
…enario)

Adds TestAirgappedMultiHopUpgrade, a smoke test that exercises the full
upgrade chain similar as to what CSO EMEA observed in a specific customer
scenario: install with MCR 25.0 / MKE 3.8.8 / MSR 2.9.27, then upgrade
through 3.8.11 → 3.8.12 (MCR 29.2) → 3.9.2 → latest MKE 3.x / MCR 29.x.
All post-install upgrade steps pull images from an internal DTR exposed on
a non-standard port (4443) via an NLB, simulating an airgapped registry
configuration.

Key design decisions
--------------------
- Image preload strategy: rather than pushing images to DTR (which requires
  namespace provisioning and hits DTR auth edge cases), all upgrade images
  are pulled from docker.io/mirantis on every node and tagged with the DTR
  registry address. Launchpad's "Pull MKE images" phase runs docker image
  inspect before docker pull; finding the image locally it skips the pull
  entirely. This exercises Launchpad's imageRepo feature without requiring
  actual DTR push/pull.

- SSH key compatibility: Terraform's tls_private_key emits OpenSSH-format
  ed25519 keys that golang.org/x/crypto/ssh.ParsePrivateKey rejects. All
  remote commands use the system ssh binary (sshRun/sshRunScript) to avoid
  Go-side key parsing.

- DTR image listing: UCP bootstrapper uses "images --list"; DTR 2.x uses
  "images" (the --list flag is unrecognised and causes help text on stdout
  with exit 0). The preload script filters output for valid image-reference
  patterns and falls back to the plain "images" subcommand automatically.

- Dynamic latest-version step: fetchLatestMKEVersion queries Docker Hub
  tags for mirantis/ucp; fetchLatestMCRChannel probes the Mirantis apt
  repository (channels are non-sequential so the probe scans the full
  range rather than stopping at the first 404). The dynamic step is
  appended only when it differs from the last fixed step.

Supporting changes
------------------
- examples/terraform/aws-simple: add msr_port variable (default 443) so
  the NLB can expose DTR on a non-standard port; all other smoke tests are
  unaffected. --dtr-external-url appends :PORT only when port != 443.
- test/platforms.go: fix RHEL8 MCR install — disable the container-tools
  module stream before installing MCR to prevent the system runc from
  conflicting with Mirantis's containerd.io-runc.
- Makefile: add smoke-airgapped-multi-hop target (-timeout 200m).
- .github/workflows/smoke-tests.yaml: add smoke-airgapped-multi-hop CI job
  triggered by the smoke-test or smoke-airgapped-multi-hop PR labels.

Tested: two passing runs (2537s with 3 fixed steps; 3329s with 4 steps
including the dynamic latest-version step MKE 3.9.3 / MCR stable-29.4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove smoke-test generic label from airgapped-multi-hop CI trigger
  (james-nesbitt): job should only run when explicitly labeled, not
  with the catch-all smoke-test label that triggers all other jobs
- Apply container-tools module fix to Rocky8 platform as well as RHEL8
  (james-nesbitt): Rocky8 has the same AppStream runc conflict with
  containerd.io-runc; the fix belongs in the shared platform defaults
- Fix inaccurate header comment: test does not push/pull against DTR;
  it tags images locally and Launchpad resolves imageRepo to local tags
  (Copilot)
- Add HTTP timeouts (30s/15s) and status-code check to
  fetchLatestMKEVersion and fetchLatestMCRChannel (Copilot)
- Update lastProduct before checking Apply() error so Reset() uses the
  config from the attempted step, not the last successful step (Copilot)
- Rename smokeConfig.Name to airgapped-multi-hop to align AWS resource
  tags with the Makefile target and CI label (Copilot)
- Add smoke-airgapped-multi-hop to AGENTS.md, workflow.md, and
  smoke-tests.md; update timeout guidance for multi-hop scenarios
  (Copilot)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
smokeConfig.Name drives the Terraform stack name as
"smoke-{Name}-{5-char-random}"; Terraform then appends
suffixes like "-mke-kube" for target groups, capping the
total at 32 chars. "airgapped-multi-hop" (19 chars) overflows
that limit — revert to "airgap-mhop" (11 chars, same limit as
the original "airgappedup"). Add a comment explaining why the
short name is necessary despite the longer CI label.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove push trigger from smoke-airgapped-multi-hop CI job; it is now
  label-gated only (prevents the 200m job from running on every main push)
- Fix inaccurate file-level comment: images are pre-tagged locally on each
  node, not pushed into DTR
- Gate dynamic 'upgrade to latest' step behind SMOKE_UPGRADE_TO_LATEST env
  var so PR smoke runs are reproducible; set the var in scheduled jobs
- Clean up redundant Terraform provider constraint (>= 6.28.0, >= 6.33.0)
  to the effective lower bound (>= 6.33.0)

Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
@james-nesbitt james-nesbitt force-pushed the PRODENG-3446-smoke-test-vocalink branch from c32367b to 42ae520 Compare June 2, 2026 13:44
@james-nesbitt
Copy link
Copy Markdown
Collaborator Author

Addressed the open review feedback in commit 42ae520:

  • Removed github.event_name == 'push' from the smoke-airgapped-multi-hop CI job — it is now label-gated only, so the 200m job no longer fires on every merge to main.
  • Fixed inaccurate file-level comment: images are pre-tagged locally on each node with the DTR registry prefix, not pushed into DTR.
  • Gated the dynamic "upgrade to latest" step (step 4) behind a SMOKE_UPGRADE_TO_LATEST env var. PR smoke runs now execute the 3 fixed steps deterministically; set the var in scheduled/nightly jobs to opt in to version discovery.
  • Cleaned up redundant Terraform provider constraint (>= 6.28.0, >= 6.33.0>= 6.33.0).

Branch has been rebased onto current main.

Written by AI: claude-sonnet-4-5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants