Skip to content

feat(installer): offline air-gapped deployment bundle#40

Merged
KerwinTsaiii merged 25 commits intoAMDResearch:developfrom
MioYuuIH:feature/offline-pack
Mar 10, 2026
Merged

feat(installer): offline air-gapped deployment bundle#40
KerwinTsaiii merged 25 commits intoAMDResearch:developfrom
MioYuuIH:feature/offline-pack

Conversation

@MioYuuIH
Copy link
Contributor

@MioYuuIH MioYuuIH commented Mar 9, 2026

Summary

  • Add auplc-installer pack command to create a self-contained offline deployment bundle (~5.4 GB) including K3s, Helm, K9s, all container images, and the Helm chart
  • Add auplc-installer install offline mode that detects the bundle and installs fully air-gapped (no network access required during install)
  • Add pack-bundle.yml CI workflow that automatically packs bundles for all GPU types after a release tag is pushed, with lazy-skip if bundle already exists in the release
  • Fix Helm schema validation error in generated values overlay
  • Optimize bundle size: save all custom images into a single tar to deduplicate shared layers (16 GB → 5.4 GB)
  • Add type=ref,event=tag to docker-build metadata rules to ensure any tag (not just semver) gets pushed with the exact tag name

Changes

auplc-installer

  • pack command: downloads K3s/Helm/K9s binaries, K3s airgap images, saves all custom and external images, copies Helm chart, writes manifest.json
  • install command: detects bundle via manifest.json, imports images via k3s ctr images import, installs offline, sets imagePullPolicy: IfNotPresent in values overlay
  • All custom images (hub, default, base, cv, dl, llm, physim) saved into one tar — shared layers (auplc-base) deduplicated across course images
  • IMAGE_TAG and IMAGE_REGISTRY configurable via env vars; IMAGE_TAG sanitized (replaces / with -) for Docker compatibility
  • Fail-fast on image import failure; GPU device plugin deployed after images are loaded
  • Fixed: hub image values overlay placed after custom: block to avoid Helm schema error
  • Refactored: removed redundant code, fixed deploy_aup_learning_cloud_runtime typo, unified image lists

.github/workflows/pack-bundle.yml

  • pack-release job: triggered automatically via workflow_run after Build Docker Images completes on a v* tag; matrix over all 4 GPU types (strix-halo, phx, strix, rdna4)
  • Lazy execution: skips GPU if bundle already exists in the GitHub Release
  • Attaches bundle to existing release (releases are created manually); skips upload if no release found
  • IMAGE_REGISTRY derived from repository owner (lowercased) for fork compatibility
  • pack-manual job: workflow_dispatch for on-demand testing with configurable GPU type, image tag, and registry

.github/workflows/docker-build.yml

  • Added type=ref,event=tag to metadata tag lists for base GPU and course images, ensuring any tag name (not just valid semver) is pushed verbatim

Test plan

  • End-to-end offline install verified on AI 395 (Ubuntu, gfx1151/strix-halo): all pods Running after air-gapped install
  • CI pack workflow: workflow_run chain triggers correctly after Build Docker Images on v* tag
  • Lazy skip: re-triggering with existing bundle in release skips pack correctly
  • Bundle size reduced from ~16 GB to ~5.4 GB via layer deduplication
  • Offline install with phx / strix / rdna4 bundles (not yet tested end-to-end)

MioYuuIH added 25 commits March 6, 2026 14:44
Add 'pack' command to create self-contained offline deployment bundles.
Support all 4 image source × deploy target combinations:

  install          build locally + deploy (existing)
  install --pull   pull from GHCR + deploy (new)
  pack             pull from GHCR + save to bundle (new)
  pack --local     build locally + save to bundle (new)

Offline bundles include K3s binary/images, Helm, K9s, ROCm device
plugin manifest, all container images, and Helm chart+values.
Auto-detected via manifest.json when running from bundle directory.

Add pack-bundle.yml CI workflow for manual bundle creation.
- Add IMAGE_REGISTRY env var (default: ghcr.io/amdresearch) for
  configurable image source in pack and install --pull
- Pack now exits with error if any custom or external image fails
  to pull, preventing incomplete bundles
- Add image_registry input to pack-bundle CI workflow
- Read IMAGE_REGISTRY from bundle manifest for offline installs
Support pulling images with non-default tag prefixes (e.g. develop-gfx1151
instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest
and restored on offline install. Default remains "latest".
…erlay

hub.image was incorrectly nested inside custom.resources.images block,
causing metadata to be misinterpreted as hub.image property and
triggering Helm schema validation failure.
… for consistency

Both pull and local-build modes now save hub/default images with
:latest and :${IMAGE_TAG} tags, matching GPU image behavior.
This ensures values.local.yaml references always resolve regardless
of which IMAGE_TAG was used during pack.
Silent warning on import failure could leave the cluster with missing
images that cause pod failures at runtime. Now exits immediately
so the user sees a clear error instead of a mysteriously broken install.
- Remove redundant CUSTOM_IMAGES/IMAGES arrays; GPU_CUSTOM_NAMES and
  PLAIN_CUSTOM_NAMES are the single source of truth for image lists
- Fix typo: deply_aup_learning_cloud_runtime -> deploy_aup_learning_cloud_runtime
- Remove duplicate generate_values_overlay call in deploy function
  (orchestration now handled exclusively by callers)
- Remove unused check_root function; inline root check at entry points
  of deploy_all_components and pack_bundle
- Add missing section headers for Runtime Management group
- rt install/reinstall and legacy install-runtime now correctly call
  detect/get_paths/generate_overlay before deploy
- Merge gpu_target + gpu_type into single gpu_type choice; installer
  derives GPU_TARGET internally via resolve_gpu_config
- Add rdna4 option (gfx120x) to match upstream installer support
- image_tag now defaults to current branch name (github.ref_name)
  so develop branch packs use 'develop' tag automatically
- Use env: block instead of inline var prefix for cleaner CI syntax
- Remove root check from pack_bundle; pack only needs docker/wget,
  not root access (install still requires root)
github.ref_name for feature branches contains '/' (e.g. feature/offline-pack)
which is invalid in Docker tags. Replace '/' with '-' when using branch
name as default IMAGE_TAG.
Branch names like 'feature/offline-pack' are invalid Docker tags.
Both the workflow and pack_bundle now auto-replace '/' with '-'
so no manual sanitization is needed by the caller.
- Add workflow_run trigger: fires after 'Build Docker Images' completes,
  ensuring all images (hub, base, courses) are built before packing starts
- pack-release job: matrix over all 4 GPU types, only runs on v* tags
  pushed to AMDResearch/aup-learning-cloud (main repo guard)
- pack-release attaches bundles to the existing GitHub Release
- pack-manual job: unchanged workflow_dispatch flow for manual testing
- Fix tar SIGPIPE false error in verify step (2>/dev/null)
… layers

Course images (cv/dl/llm/physim) all share auplc-base layers. Saving them
separately caused those layers to be written N times. A single docker save
call with all image refs deduplicates shared layers automatically, reducing
bundle size significantly.
Ensures any v* tag (semver or not) gets pushed with the exact tag name.
Previously non-semver tags (e.g. v0.1-test) would only get sha-based tags,
causing course image builds to fail when looking for the base image by tag.

Also removes main repo restriction from pack-release trigger condition.
@MioYuuIH MioYuuIH requested a review from KerwinTsaiii as a code owner March 9, 2026 05:13
@MioYuuIH
Copy link
Contributor Author

MioYuuIH commented Mar 9, 2026

CI Build Artifacts (fork)

Built on fork via automated workflow_run trigger after v0.1.0 tag push.

Run: https://github.com/MioYuuIH/aup-learning-cloud/actions/runs/22838515462

GPU Type Architecture Bundle
strix-halo gfx1151 (Ryzen AI Max+ 395 / Max 390) auplc-bundle-strix-halo
phx gfx110x (Ryzen AI 300 Phoenix) auplc-bundle-phx
strix gfx110x + HSA override (Ryzen AI 300 Strix Point) auplc-bundle-strix
rdna4 gfx120x (Radeon RX 9000 series) auplc-bundle-rdna4

All 4 bundles ~5.4 GB each (down from ~16 GB before layer deduplication). Artifacts expire 2026-04-08.

Verified: strix-halo bundle installed and ran successfully on AI 395 (Ubuntu, air-gapped).

@KerwinTsaiii KerwinTsaiii merged commit 02ae5df into AMDResearch:develop Mar 10, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants