Skip to content

Add Jenkins pipeline for airgap infrastructure deployment and testing#498

Open
floatingman wants to merge 42 commits intorancher:mainfrom
floatingman:feature/convert-airgap-tfp-tests
Open

Add Jenkins pipeline for airgap infrastructure deployment and testing#498
floatingman wants to merge 42 commits intorancher:mainfrom
floatingman:feature/convert-airgap-tfp-tests

Conversation

@floatingman
Copy link
Contributor

@floatingman floatingman commented Feb 4, 2026

Implement a Jenkins pipeline to facilitate airgap infrastructure deployment and testing. This includes enhancements to the Dockerfile for Go testing, improvements in the Jenkinsfile for better logging and test handling, and the addition of stages for admin token injection and verification. The pipeline also integrates a retry mechanism for Ansible playbook executions and incorporates Qase reporting for test results.

rancher/qa-tasks#2125

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new Jenkins pipeline to provision airgapped RKE2/Rancher infrastructure, inject an admin token, run Go-based validation tests, and publish results (including to Qase), plus a dedicated Docker image for these Go test runs.

Changes:

  • Introduces Jenkinsfile.airgap.go-tests to clone the tests and qa-infra repos, build an infra tools image with Go, deploy airgapped infra via Tofu/Ansible, inject an admin token, run Go tests with gotestsum, and optionally report to Qase.
  • Adds Dockerfile.airgap-go-tests to build an Alpine-based infra tools image that includes OpenTofu, Ansible, AWS CLI, Go, and gotestsum for use in the pipeline.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
validation/pipeline/Jenkinsfile.airgap.go-tests Defines the end-to-end Jenkins pipeline for airgapped infra setup, admin token generation, Go test execution, artifact publishing, and optional Qase reporting.
validation/pipeline/Dockerfile.airgap-go-tests Builds the infra tools Docker image with Go toolchain and gotestsum used by the new pipeline stages.

Comment on lines 17 to 20
def testsBranch = env.GO_REPO_BRANCH ?: 'main'
def testsRepo = env.GO_REPO_URL ?: 'https://github.com/rancher/tests'
def qaInfraBranch = env.QA_INFRA_REPO_BRANCH ?: 'main'
def qaInfraRepo = env.QA_INFRA_REPO_URL ?: 'https://github.com/rancher/qa-infra-automation'
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pipeline introduces GO_REPO_BRANCH/GO_REPO_URL for the tests repo, which diverges from the established RANCHER_TEST_REPO_BRANCH/RANCHER_TEST_REPO_URL naming used by other airgap pipelines (for example validation/pipeline/Jenkinsfile.setup.airgap.rke2:13-14 and validation/pipeline/Jenkinsfile.destroy.airgap.rke2:11-12). Reusing the existing env var names (or at least falling back to them) would keep job configuration consistent and avoid confusion when wiring Jenkins jobs to different airgap pipelines.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

Comment on lines 428 to 448
if ((env.DESTROY_ON_FAILURE ?: 'true').toBoolean() && workspaceName) {
echo 'DESTROY_ON_FAILURE is enabled. Cleaning up infrastructure...'
try {
stage('Cleanup on Failure') {
tofu.selectWorkspace(dir: tofuModulePath, name: workspaceName)
tofu.destroy(dir: tofuModulePath, varFile: 'terraform.tfvars', autoApprove: true)
tofu.deleteWorkspace(dir: tofuModulePath, name: workspaceName)
}
} catch (cleanupErr) {
echo "Cleanup failed: ${cleanupErr.message}"
}
}
throw err
} finally {
if (destroyAfterTests && workspaceName) {
echo 'Destroying infrastructure after tests (configured)'
try {
stage('Destroy After Tests') {
tofu.selectWorkspace(dir: tofuModulePath, name: workspaceName)
tofu.destroy(dir: tofuModulePath, varFile: 'terraform.tfvars', autoApprove: true)
tofu.deleteWorkspace(dir: tofuModulePath, name: workspaceName)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On failure the catch block already performs a best‑effort destroy/deleteWorkspace when DESTROY_ON_FAILURE is enabled, and then the finally block can run the same destroy logic again if DESTROY_AFTER_TESTS is also true. This double attempt to clean up the same workspace can create noisy errors in logs and makes it harder to reason about which flag controls teardown; consider centralizing the destroy logic in one place (or tracking whether cleanup has already succeeded) to keep the control flow clearer.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.


# Install gotestsum for JUnit reporting
ENV GOBIN=/usr/local/bin
RUN go install gotest.tools/gotestsum@latest
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing gotestsum with go install gotest.tools/gotestsum@latest pulls executable code from a mutable, third-party module reference at build time, which introduces a supply-chain risk. If the gotest.tools/gotestsum module or its distribution channel is ever compromised or a malicious version is published, the resulting binary will run inside this image (and thus in Jenkins jobs) with access to AWS credentials and other secrets used by the pipeline. Pin this dependency to a specific, vetted version (for example, a fixed tag or commit) and, if possible, enforce integrity verification via checksums or vendored binaries to prevent untrusted code from being pulled implicitly.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this.

@floatingman floatingman self-assigned this Feb 12, 2026
@floatingman floatingman added the team/pit-crew slack notifier for pit crew label Feb 12, 2026
… test command and improve test result handling
…ization with qaConfig and dockerPlatform, and update test command execution to use dynamic container name and infraToolsImage
…ialization and directly use config for dockerPlatform and infraToolsImage
…mage before building to ensure a clean build
@floatingman floatingman force-pushed the feature/convert-airgap-tfp-tests branch from 294090e to 0457967 Compare February 13, 2026 16:37
ARG GO_VERSION=1.25.5
ARG GOTESTUM_VERSION=1.13.0

FROM --platform=linux/amd64 alpine:3.22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we typically try to use suse based images

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be addressed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think instead of using this file we probably either:
1- Use Dockerfile.infra and make it FROM registry.suse.com/bci/golang:1.25
2- Use Dockerfile.e2e

Comment on lines 77 to 84
stage('Verify Infra Tools Tooling') {
echo 'Verifying gotestsum availability inside infra tools image'
sh """
docker run --rm --platform ${dockerPlatform} \
${infraToolsImage} \
sh -c 'set -e; echo \"PATH=$PATH\"; which gotestsum || true; ls -al /root/go/bin || true; ls -al /usr/local/bin/gotestsum || true'
"""
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unnecessary, what use case would it not be available (unless the build failed outright)?

Comment on lines 86 to 128
stage('Configure SSH Key') {
infrastructure.writeSshKey(
keyContent: env.AWS_SSH_PEM_KEY,
keyName: env.AWS_SSH_PEM_KEY_NAME,
dir: '.ssh'
)
}

stage('Configure Tofu Variables') {
echo 'Writing Terraform configuration'

def terraformConfig = infrastructure.parseAndSubstituteVars(
content: env.TERRAFORM_CONFIG,
envVars: [
'AWS_ACCESS_KEY_ID': env.AWS_ACCESS_KEY_ID,
'AWS_SECRET_ACCESS_KEY': env.AWS_SECRET_ACCESS_KEY,
'HOSTNAME_PREFIX': env.HOSTNAME_PREFIX,
'AWS_SSH_PEM_KEY_NAME': env.AWS_SSH_PEM_KEY_NAME
]
)

infrastructure.writeConfig(
path: "${tofuModulePath}/terraform.tfvars",
content: terraformConfig
)
}

stage('Initialize Tofu Backend') {
tofu.initBackend(
dir: tofuModulePath,
bucket: env.S3_BUCKET_NAME,
key: env.S3_KEY_PREFIX,
region: env.S3_BUCKET_REGION,
backendInitScript: tofuBackendInitScript
)
}

stage('Create Workspace') {
workspaceName = infrastructure.generateWorkspaceName(
prefix: 'jenkins_airgap_ansible_workspace',
suffix: env.HOSTNAME_PREFIX,
includeTimestamp: false
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get wanting more resolute logs, but now there are too many stages to be viewed on a standard screen (when looking at the jenkins job). Since each of these take less than 10 seconds, can you consolidate into 1 stage?

Comment on lines 198 to 217
stage('Setup SSH Keys on Nodes') {
retry(3) {
ansible.runPlaybook(
dir: ansiblePath,
inventory: 'inventory/inventory.yml',
playbook: 'playbooks/setup/setup-ssh-keys.yml'
)
}
}

stage('Deploy RKE2 Cluster') {
retry(3) {
ansible.runPlaybook(
dir: ansiblePath,
inventory: 'inventory/inventory.yml',
playbook: 'playbooks/deploy/rke2-tarball-playbook.yml'
)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you consolidate these 2 with the previous stage, I think it would fit on 1 screen again.

}
}

stage('Deploy Rancher (Optional)') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, optional stages break the view in jenkins if the stage is actually non-existent when its not specified. In this case, it might be OK since there's the def + echo?

}

stage('Deploy RKE2 Cluster') {
retry(3) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarifying question; for error handling, if it fails on the 3rd attempt, does it stop the jenkins job too (assuming so but just double checking)

Comment on lines 280 to 285
// Run the Ansible playbook to generate and inject the admin token
// The playbook uses the rancher_token role which:
// - Reads external_lb_hostname from inventory (or uses explicit rancher_url)
// - Authenticates with Rancher API
// - Creates an API token with configurable TTL and description
// - Updates cattle-config.yaml with the generated token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is in the qa-infra-automation docs already and basically no one looks at jenkinsfiles, I think you should remove these comments.

resultsJSON: 'gotestsum.json'
])

if (testArgs && testArgs[-1]?.endsWith(';')) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Comment on lines 435 to 450
stage('Cleanup on Failure') {
tofu.selectWorkspace(dir: tofuModulePath, name: workspaceName)
tofu.destroy(dir: tofuModulePath, varFile: 'terraform.tfvars', autoApprove: true)
tofu.deleteWorkspace(dir: tofuModulePath, name: workspaceName)
infrastructureCleaned = true
}
} catch (cleanupErr) {
echo "Cleanup failed: ${cleanupErr.message}"
}
}
throw err
} finally {
if (destroyAfterTests && workspaceName && !infrastructureCleaned) {
echo 'Destroying infrastructure after tests (configured)'
try {
stage('Destroy After Tests') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having conditional stages like this with different names will cause the job history to not show up properly in jenkins. It looks like 'cleanup on failure' and 'destroy after tests' are doing the same thing? in which case, if you use the same stage name in both cases you can avoid this problem.

If they do different things, can you update to persist each stage, even if there's nothing to do on a particular run for it? see the rancher comment for a potential solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use the same name. The two stages were doing different things. One destroys the infrastructure on failure; the other destroys it at the end of the build in case you want to investigate something manually.

ARG GO_VERSION=1.25.5
ARG GOTESTUM_VERSION=1.13.0

FROM --platform=linux/amd64 alpine:3.22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be addressed

ARG GO_VERSION=1.25.5
ARG GOTESTUM_VERSION=1.13.0

FROM --platform=linux/amd64 alpine:3.22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think instead of using this file we probably either:
1- Use Dockerfile.infra and make it FROM registry.suse.com/bci/golang:1.25
2- Use Dockerfile.e2e

property.useWithProperties(['AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'AWS_SSH_PEM_KEY', 'AWS_SSH_PEM_KEY_NAME']) {
try {
stage('Checkout') {
deleteDir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this is me being dumb with Jenkins, but why is this here?

Comment on lines 94 to 95
stage('Configure Tofu Variables') {
echo 'Writing Terraform configuration'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need both of these?

def s3Path = "env:/${workspaceName}/terraform.tfvars"
def tfvarsPath = "${tofuModulePath}/terraform.tfvars"
def workspace = pwd()
sh """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again possibly a dumb question, but why sh """?

Comment on lines 232 to 244
def deployRancher = env.ANSIBLE_VARIABLES?.contains('deploy_rancher: true')
if (deployRancher) {
echo 'Deploying Rancher...'
retry(3) {
ansible.runPlaybook(
dir: ansiblePath,
inventory: 'inventory/inventory.yml',
playbook: 'playbooks/deploy/rancher-helm-deploy-playbook.yml'
)
}
} else {
echo 'Skipping Rancher deployment (not enabled in ANSIBLE_VARIABLES)'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use this oportunity to comment on something I was thinking of regarding the ansible stuff. And that is the fact that we have a bunch of variables like deploy_rancher that basically make the playbook that uses them useless when not set, so instead of setting deploy_rancher: false, why not just don't run the playbook?

Again, no implications for this PR, I just was thinking of this for a while.

Comment on lines +298 to +311
sh """
docker run --rm --platform ${dockerPlatform} \
--name generate-token \
-v ${workspace}:/workspace \
-w /workspace/${ansiblePath} \
${infraToolsImage} \
ansible-playbook -i inventory/inventory.yml /workspace/qa-infra-automation/ansible/rancher/token/generate-admin-token.yml \
-e rancher_token_password=${adminPassword} \
-e rancher_cattle_config_file=${cattleConfigPath} \
-e rancher_token_ttl=${tokenTtl} \
-e rancher_token_description=${tokenDescription} \
-e rancher_token_output_format=json \
-e rancher_token_output_file=/workspace/rancher-token.json
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to use runPlaybook here?

]
}
// If Ansible variables defined a bootstrap password, default commonly used value
lines += ["RANCHER_BOOTSTRAP_PASSWORD=${env.RANCHER_BOOTSTRAP_PASSWORD ?: 'rancherrocks'}"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't ansible already use this if nothing is set?

…dencies, streamline installation steps, and enhance error handling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team/pit-crew slack notifier for pit crew

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments