feat(installer): offline air-gapped deployment bundle by MioYuuIH · Pull Request #40 · AMDResearch/aup-learning-cloud

MioYuuIH · 2026-03-09T05:13:58Z

Summary

Add auplc-installer pack command to create a self-contained offline deployment bundle (~5.4 GB) including K3s, Helm, K9s, all container images, and the Helm chart
Add auplc-installer install offline mode that detects the bundle and installs fully air-gapped (no network access required during install)
Add pack-bundle.yml CI workflow that automatically packs bundles for all GPU types after a release tag is pushed, with lazy-skip if bundle already exists in the release
Fix Helm schema validation error in generated values overlay
Optimize bundle size: save all custom images into a single tar to deduplicate shared layers (16 GB → 5.4 GB)
Add type=ref,event=tag to docker-build metadata rules to ensure any tag (not just semver) gets pushed with the exact tag name

Changes

`auplc-installer`

pack command: downloads K3s/Helm/K9s binaries, K3s airgap images, saves all custom and external images, copies Helm chart, writes manifest.json
install command: detects bundle via manifest.json, imports images via k3s ctr images import, installs offline, sets imagePullPolicy: IfNotPresent in values overlay
All custom images (hub, default, base, cv, dl, llm, physim) saved into one tar — shared layers (auplc-base) deduplicated across course images
IMAGE_TAG and IMAGE_REGISTRY configurable via env vars; IMAGE_TAG sanitized (replaces / with -) for Docker compatibility
Fail-fast on image import failure; GPU device plugin deployed after images are loaded
Fixed: hub image values overlay placed after custom: block to avoid Helm schema error
Refactored: removed redundant code, fixed deploy_aup_learning_cloud_runtime typo, unified image lists

`.github/workflows/pack-bundle.yml`

pack-release job: triggered automatically via workflow_run after Build Docker Images completes on a v* tag; matrix over all 4 GPU types (strix-halo, phx, strix, rdna4)
Lazy execution: skips GPU if bundle already exists in the GitHub Release
Attaches bundle to existing release (releases are created manually); skips upload if no release found
IMAGE_REGISTRY derived from repository owner (lowercased) for fork compatibility
pack-manual job: workflow_dispatch for on-demand testing with configurable GPU type, image tag, and registry

`.github/workflows/docker-build.yml`

Added type=ref,event=tag to metadata tag lists for base GPU and course images, ensuring any tag name (not just valid semver) is pushed verbatim

Test plan

End-to-end offline install verified on AI 395 (Ubuntu, gfx1151/strix-halo): all pods Running after air-gapped install
CI pack workflow: workflow_run chain triggers correctly after Build Docker Images on v* tag
Lazy skip: re-triggering with existing bundle in release skips pack correctly
Bundle size reduced from ~16 GB to ~5.4 GB via layer deduplication
Offline install with phx / strix / rdna4 bundles (not yet tested end-to-end)

Add 'pack' command to create self-contained offline deployment bundles. Support all 4 image source × deploy target combinations: install build locally + deploy (existing) install --pull pull from GHCR + deploy (new) pack pull from GHCR + save to bundle (new) pack --local build locally + save to bundle (new) Offline bundles include K3s binary/images, Helm, K9s, ROCm device plugin manifest, all container images, and Helm chart+values. Auto-detected via manifest.json when running from bundle directory. Add pack-bundle.yml CI workflow for manual bundle creation.

- Add IMAGE_REGISTRY env var (default: ghcr.io/amdresearch) for configurable image source in pack and install --pull - Pack now exits with error if any custom or external image fails to pull, preventing incomplete bundles - Add image_registry input to pack-bundle CI workflow - Read IMAGE_REGISTRY from bundle manifest for offline installs

Support pulling images with non-default tag prefixes (e.g. develop-gfx1151 instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest and restored on offline install. Default remains "latest".

…stry/tag restore out of gpu_target guard

… in K3s airgap bundle

…erlay hub.image was incorrectly nested inside custom.resources.images block, causing metadata to be misinterpreted as hub.image property and triggering Helm schema validation failure.

… for consistency Both pull and local-build modes now save hub/default images with :latest and :${IMAGE_TAG} tags, matching GPU image behavior. This ensures values.local.yaml references always resolve regardless of which IMAGE_TAG was used during pack.

Silent warning on import failure could leave the cluster with missing images that cause pod failures at runtime. Now exits immediately so the user sees a clear error instead of a mysteriously broken install.

- Remove redundant CUSTOM_IMAGES/IMAGES arrays; GPU_CUSTOM_NAMES and PLAIN_CUSTOM_NAMES are the single source of truth for image lists - Fix typo: deply_aup_learning_cloud_runtime -> deploy_aup_learning_cloud_runtime - Remove duplicate generate_values_overlay call in deploy function (orchestration now handled exclusively by callers) - Remove unused check_root function; inline root check at entry points of deploy_all_components and pack_bundle - Add missing section headers for Runtime Management group - rt install/reinstall and legacy install-runtime now correctly call detect/get_paths/generate_overlay before deploy

- Merge gpu_target + gpu_type into single gpu_type choice; installer derives GPU_TARGET internally via resolve_gpu_config - Add rdna4 option (gfx120x) to match upstream installer support - image_tag now defaults to current branch name (github.ref_name) so develop branch packs use 'develop' tag automatically - Use env: block instead of inline var prefix for cleaner CI syntax - Remove root check from pack_bundle; pack only needs docker/wget, not root access (install still requires root)

github.ref_name for feature branches contains '/' (e.g. feature/offline-pack) which is invalid in Docker tags. Replace '/' with '-' when using branch name as default IMAGE_TAG.

Branch names like 'feature/offline-pack' are invalid Docker tags. Both the workflow and pack_bundle now auto-replace '/' with '-' so no manual sanitization is needed by the caller.

- Add workflow_run trigger: fires after 'Build Docker Images' completes, ensuring all images (hub, base, courses) are built before packing starts - pack-release job: matrix over all 4 GPU types, only runs on v* tags pushed to AMDResearch/aup-learning-cloud (main repo guard) - pack-release attaches bundles to the existing GitHub Release - pack-manual job: unchanged workflow_dispatch flow for manual testing - Fix tar SIGPIPE false error in verify step (2>/dev/null)

… layers Course images (cv/dl/llm/physim) all share auplc-base layers. Saving them separately caused those layers to be written N times. A single docker save call with all image refs deduplicates shared layers automatically, reducing bundle size significantly.

Ensures any v* tag (semver or not) gets pushed with the exact tag name. Previously non-semver tags (e.g. v0.1-test) would only get sha-based tags, causing course image builds to fail when looking for the base image by tag. Also removes main repo restriction from pack-release trigger condition.

MioYuuIH · 2026-03-09T05:18:57Z

CI Build Artifacts (fork)

Built on fork via automated workflow_run trigger after v0.1.0 tag push.

Run: https://github.com/MioYuuIH/aup-learning-cloud/actions/runs/22838515462

GPU Type	Architecture	Bundle
strix-halo	gfx1151 (Ryzen AI Max+ 395 / Max 390)	`auplc-bundle-strix-halo`
phx	gfx110x (Ryzen AI 300 Phoenix)	`auplc-bundle-phx`
strix	gfx110x + HSA override (Ryzen AI 300 Strix Point)	`auplc-bundle-strix`
rdna4	gfx120x (Radeon RX 9000 series)	`auplc-bundle-rdna4`

All 4 bundles ~5.4 GB each (down from ~16 GB before layer deduplication). Artifacts expire 2026-04-08.

Verified: strix-halo bundle installed and ran successfully on AI 395 (Ubuntu, air-gapped).

MioYuuIH added 25 commits March 6, 2026 14:44

feat(installer): add IMAGE_TAG env var for configurable image tag prefix

5dd55b9

Support pulling images with non-default tag prefixes (e.g. develop-gfx1151 instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest and restored on offline install. Default remains "latest".

fix(installer): fix glob quoting in load_offline_images and move regi…

f3dc904

…stry/tag restore out of gpu_target guard

fix(installer): remove traefik from EXTERNAL_IMAGES, already included…

a2c1c02

… in K3s airgap bundle

fix(installer): load offline images before deploying GPU device plugin

3e11295

fix(offline): move hub image override after custom block in values ov…

7895da6

…erlay hub.image was incorrectly nested inside custom.resources.images block, causing metadata to be misinterpreted as hub.image property and triggering Helm schema validation failure.

fix(offline): fail fast when image import fails in load_offline_images

0e40738

Silent warning on import failure could leave the cluster with missing images that cause pod failures at runtime. Now exits immediately so the user sees a clear error instead of a mysteriously broken install.

fix(ci): sanitize branch name for Docker image tag

993c88d

github.ref_name for feature branches contains '/' (e.g. feature/offline-pack) which is invalid in Docker tags. Replace '/' with '-' when using branch name as default IMAGE_TAG.

fix(ci/pack): silently sanitize IMAGE_TAG by replacing '/' with '-'

ad57e6b

Branch names like 'feature/offline-pack' are invalid Docker tags. Both the workflow and pack_bundle now auto-replace '/' with '-' so no manual sanitization is needed by the caller.

ci(pack): remove slow file count in verify step

fda559b

ci(pack): simplify verify step to filename and size only

7ef70b8

ci(pack): allow fork repo in release trigger condition

832e749

ci(pack): derive IMAGE_REGISTRY from repository owner

0059856

ci(pack): lowercase repository owner for image registry

4afa48f

ci(pack): continue-on-error for release asset upload

9603fac

ci(pack): skip release upload if no release exists, don't auto-create

d8fcd10

ci(pack): skip GPU if bundle already exists in release

8b4540f

ci(pack): remove auto-create release option from manual job

2a0f2ba

MioYuuIH requested a review from KerwinTsaiii as a code owner March 9, 2026 05:13

KerwinTsaiii merged commit 02ae5df into AMDResearch:develop Mar 10, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(installer): offline air-gapped deployment bundle#40

feat(installer): offline air-gapped deployment bundle#40
KerwinTsaiii merged 25 commits intoAMDResearch:developfrom
MioYuuIH:feature/offline-pack

MioYuuIH commented Mar 9, 2026

Uh oh!

MioYuuIH commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MioYuuIH commented Mar 9, 2026

Summary

Changes

auplc-installer

.github/workflows/pack-bundle.yml

.github/workflows/docker-build.yml

Test plan

Uh oh!

MioYuuIH commented Mar 9, 2026

CI Build Artifacts (fork)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`auplc-installer`

`.github/workflows/pack-bundle.yml`

`.github/workflows/docker-build.yml`