feat(installer): offline air-gapped deployment bundle#40
Merged
KerwinTsaiii merged 25 commits intoAMDResearch:developfrom Mar 10, 2026
Merged
feat(installer): offline air-gapped deployment bundle#40KerwinTsaiii merged 25 commits intoAMDResearch:developfrom
KerwinTsaiii merged 25 commits intoAMDResearch:developfrom
Conversation
Add 'pack' command to create self-contained offline deployment bundles. Support all 4 image source × deploy target combinations: install build locally + deploy (existing) install --pull pull from GHCR + deploy (new) pack pull from GHCR + save to bundle (new) pack --local build locally + save to bundle (new) Offline bundles include K3s binary/images, Helm, K9s, ROCm device plugin manifest, all container images, and Helm chart+values. Auto-detected via manifest.json when running from bundle directory. Add pack-bundle.yml CI workflow for manual bundle creation.
- Add IMAGE_REGISTRY env var (default: ghcr.io/amdresearch) for configurable image source in pack and install --pull - Pack now exits with error if any custom or external image fails to pull, preventing incomplete bundles - Add image_registry input to pack-bundle CI workflow - Read IMAGE_REGISTRY from bundle manifest for offline installs
Support pulling images with non-default tag prefixes (e.g. develop-gfx1151 instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest and restored on offline install. Default remains "latest".
…stry/tag restore out of gpu_target guard
… in K3s airgap bundle
…erlay hub.image was incorrectly nested inside custom.resources.images block, causing metadata to be misinterpreted as hub.image property and triggering Helm schema validation failure.
… for consistency
Both pull and local-build modes now save hub/default images with
:latest and :${IMAGE_TAG} tags, matching GPU image behavior.
This ensures values.local.yaml references always resolve regardless
of which IMAGE_TAG was used during pack.
Silent warning on import failure could leave the cluster with missing images that cause pod failures at runtime. Now exits immediately so the user sees a clear error instead of a mysteriously broken install.
- Remove redundant CUSTOM_IMAGES/IMAGES arrays; GPU_CUSTOM_NAMES and PLAIN_CUSTOM_NAMES are the single source of truth for image lists - Fix typo: deply_aup_learning_cloud_runtime -> deploy_aup_learning_cloud_runtime - Remove duplicate generate_values_overlay call in deploy function (orchestration now handled exclusively by callers) - Remove unused check_root function; inline root check at entry points of deploy_all_components and pack_bundle - Add missing section headers for Runtime Management group - rt install/reinstall and legacy install-runtime now correctly call detect/get_paths/generate_overlay before deploy
- Merge gpu_target + gpu_type into single gpu_type choice; installer derives GPU_TARGET internally via resolve_gpu_config - Add rdna4 option (gfx120x) to match upstream installer support - image_tag now defaults to current branch name (github.ref_name) so develop branch packs use 'develop' tag automatically - Use env: block instead of inline var prefix for cleaner CI syntax - Remove root check from pack_bundle; pack only needs docker/wget, not root access (install still requires root)
github.ref_name for feature branches contains '/' (e.g. feature/offline-pack) which is invalid in Docker tags. Replace '/' with '-' when using branch name as default IMAGE_TAG.
Branch names like 'feature/offline-pack' are invalid Docker tags. Both the workflow and pack_bundle now auto-replace '/' with '-' so no manual sanitization is needed by the caller.
- Add workflow_run trigger: fires after 'Build Docker Images' completes, ensuring all images (hub, base, courses) are built before packing starts - pack-release job: matrix over all 4 GPU types, only runs on v* tags pushed to AMDResearch/aup-learning-cloud (main repo guard) - pack-release attaches bundles to the existing GitHub Release - pack-manual job: unchanged workflow_dispatch flow for manual testing - Fix tar SIGPIPE false error in verify step (2>/dev/null)
… layers Course images (cv/dl/llm/physim) all share auplc-base layers. Saving them separately caused those layers to be written N times. A single docker save call with all image refs deduplicates shared layers automatically, reducing bundle size significantly.
Ensures any v* tag (semver or not) gets pushed with the exact tag name. Previously non-semver tags (e.g. v0.1-test) would only get sha-based tags, causing course image builds to fail when looking for the base image by tag. Also removes main repo restriction from pack-release trigger condition.
Contributor
Author
CI Build Artifacts (fork)Built on fork via automated Run: https://github.com/MioYuuIH/aup-learning-cloud/actions/runs/22838515462
All 4 bundles ~5.4 GB each (down from ~16 GB before layer deduplication). Artifacts expire 2026-04-08. Verified: strix-halo bundle installed and ran successfully on AI 395 (Ubuntu, air-gapped). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
auplc-installer packcommand to create a self-contained offline deployment bundle (~5.4 GB) including K3s, Helm, K9s, all container images, and the Helm chartauplc-installer installoffline mode that detects the bundle and installs fully air-gapped (no network access required during install)pack-bundle.ymlCI workflow that automatically packs bundles for all GPU types after a release tag is pushed, with lazy-skip if bundle already exists in the releasetype=ref,event=tagto docker-build metadata rules to ensure any tag (not just semver) gets pushed with the exact tag nameChanges
auplc-installerpackcommand: downloads K3s/Helm/K9s binaries, K3s airgap images, saves all custom and external images, copies Helm chart, writesmanifest.jsoninstallcommand: detects bundle viamanifest.json, imports images viak3s ctr images import, installs offline, setsimagePullPolicy: IfNotPresentin values overlayIMAGE_TAGandIMAGE_REGISTRYconfigurable via env vars;IMAGE_TAGsanitized (replaces/with-) for Docker compatibilitycustom:block to avoid Helm schema errordeploy_aup_learning_cloud_runtimetypo, unified image lists.github/workflows/pack-bundle.ymlpack-releasejob: triggered automatically viaworkflow_runafterBuild Docker Imagescompletes on av*tag; matrix over all 4 GPU types (strix-halo, phx, strix, rdna4)IMAGE_REGISTRYderived from repository owner (lowercased) for fork compatibilitypack-manualjob:workflow_dispatchfor on-demand testing with configurable GPU type, image tag, and registry.github/workflows/docker-build.ymltype=ref,event=tagto metadata tag lists for base GPU and course images, ensuring any tag name (not just valid semver) is pushed verbatimTest plan
workflow_runchain triggers correctly afterBuild Docker Imagesonv*tag