From 485ed49d456d0b81c28384d14a35cb01bef8bbb4 Mon Sep 17 00:00:00 2001 From: joelteply Date: Sat, 30 May 2026 12:34:50 -0500 Subject: [PATCH 1/2] fix(ci): canary tag default for install-smoke + fail-loud precheck MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two complementary changes, both architecturally driven by Joel 2026-05-30: "We don't need to rebuild all docker obviously until we go into main. Takes a lot of machines. ... Fix properly. What broke, what is the long term goal." What broke: PR #1476's avatars-context fix succeeded but install-smoke still failed at 25m45s. The 'pull pr-N image, silently fall back to local build if missing' chain meant that for ANY PR where the dev hadn't run scripts/push-current-arch.sh, install.sh's `compose pull 2>/dev/null || warn ... will build locally` slipped into `compose up` → `docker build` → `cargo build --release` → timeout. That's the wrong default in two dimensions: per-PR docker rebuilds aren't worth it at the canary level (would consume many machines per PR), and the silent downgrade hides the actual issue (image missing) behind a 25-min compute burn. Long-term goal: the docker build is bloated by Node-legacy chat surface that the Rust-core / thin-Node-client extraction will remove. Once that's done, builds are small enough that per-PR images become viable. Until then, canary PR install-smoke validates the install PATH against canary's binary; the BINARY validation runs at main promotion when fresh images get built. Two changes: 1. .github/workflows/carl-install-smoke.yml — default to :canary for every PR run (and manual triggers). The previous logic interpolated to pr-${PR_NUMBER} for PRs, which silently required an image that the canary-stage workflow shouldn't depend on. workflow_dispatch `image_tag` input still works for the rare explicit pr-N case (binary regression debug, historical canary check, etc.). 2. scripts/ci/carl-install-smoke.sh — add a pre-flight check that verifies all 4 required image variants (continuum-core-vulkan, node-server, widget-server, model-init) exist at the resolved tag. If missing, fail-LOUD with a concrete diagnostic ("dev push pipeline didn't publish, run scripts/push-current-arch.sh") instead of silently falling through to install.sh's local-build path. The CARL_ALLOW_LOCAL_BUILD=1 escape hatch is preserved for explicit build-path debugging. Net effect: - canary PRs (the common case) → tag :canary → images exist → install smoke runs against canary's binary in normal time. - canary images somehow missing (real bug) → fail-LOUD with actionable message, not silent 25-min timeout. - main-promotion runs and explicit pr-N tests → still work via workflow_dispatch input. The avatars-context fix from PR #1476 is NOT included here — it's a separate concern (the docker-compose dangling line); PR #1476 lands that piece. This commit fixes the CI-side silent-downgrade pattern. Co-Authored-By: Claude Opus 4.7 --- .github/workflows/carl-install-smoke.yml | 41 ++++++----- scripts/ci/carl-install-smoke.sh | 90 ++++++++++++++++++++++++ 2 files changed, 114 insertions(+), 17 deletions(-) diff --git a/.github/workflows/carl-install-smoke.yml b/.github/workflows/carl-install-smoke.yml index 27c563935..7ffed4ca8 100644 --- a/.github/workflows/carl-install-smoke.yml +++ b/.github/workflows/carl-install-smoke.yml @@ -94,24 +94,31 @@ jobs: env: # PR HEAD sha so smoke fetches install.sh from THIS PR. CARL_INSTALL_REF: ${{ github.event.pull_request.head.sha || inputs.install_ref || github.sha }} - # Pin docker images to :pr-N (PR-scoped, mutable per push). Refreshed - # by push-image.sh on every dev push, so always reflects this PR's - # latest source — but never collides with another PR or canary. - # Slices the dev didn't push directly are aliased from :canary by the - # dev script (manifest copy, no rebuild). :latest was the prior - # default and went 9-14 days stale in April 2026 — never use it for - # smoke. + # Default to the canary image tag for ALL PR runs (and manual + # triggers). Per Joel 2026-05-30: per-PR docker rebuilds aren't + # worthwhile at the canary level — image publishing takes a lot of + # machines and the build is currently bloated by Node-legacy + # surface that the longer-term Rust-core / thin-Node-client + # extraction will remove. Image rebuilds are a main-promotion + # gate, not a per-PR check. # - # Resolution priority: PR# > input.image_tag > 'canary'. - # On workflow_dispatch (no PR context) the bare `pr-${{ ... }}` - # interpolated to 'pr-' (empty after dash), causing install.sh to - # miss the registry and fall back to 'will build locally' — which - # then ran a full Rust compile of continuum-core-vulkan on the - # no-GPU runner and hit the 25-min runner cap (observed run - # 25400718464). The conditional below makes manual triggers - # default to the canary tag (the cadence we publish on) and lets - # operators override via the image_tag input from the UI. - CONTINUUM_IMAGE_TAG: ${{ github.event.pull_request.number && format('pr-{0}', github.event.pull_request.number) || inputs.image_tag || 'canary' }} + # The previous logic set pr-${PR_NUMBER} for PR runs, which + # required `scripts/push-current-arch.sh` to have run for the PR + # before the smoke would pass. That published images per PR which + # we don't actually need — it just generated "image missing → + # silent compose build → 25-min timeout" failures (observed on + # #1476 at 25m45s; #1085 from May 11 also has this exact failure + # signature). Defaulting to :canary tests the install path + # against canary's binary, which is the correct semantic for the + # PR-stage gate: validate THIS PR's install.sh + docker-compose + # changes; validate the binary at main promotion when fresh + # images get built. + # + # Manual triggers + workflow_dispatch can still override via the + # `image_tag` input (useful for explicit pr-N testing when a dev + # has pushed pr-N for binary regression work, or for testing a + # specific historical canary tag). + CONTINUUM_IMAGE_TAG: ${{ inputs.image_tag || 'canary' }} # 25-min cap on the docker-only install. Hybrid (Mac source-build) # path would exceed this — by design, that's the gate firing on # the README/install mismatch. diff --git a/scripts/ci/carl-install-smoke.sh b/scripts/ci/carl-install-smoke.sh index 8a59d1074..c597791f4 100644 --- a/scripts/ci/carl-install-smoke.sh +++ b/scripts/ci/carl-install-smoke.sh @@ -73,6 +73,96 @@ teardown() { } trap teardown EXIT INT TERM +# ── 0. Pre-flight: verify the required ghcr.io images exist ── +# install.sh has a `compose pull 2>/dev/null || warn ... will build locally` +# fallback so end users on uncommon architectures (e.g. ports to future +# phone targets) still have a path. CI must NOT take that fallback — +# building continuum-core-vulkan from source on the no-GPU GHA runner +# is a full cargo build --release that takes 25+ minutes and hits +# CARL_INSTALL_TIMEOUT_SEC, which is exactly the silent downgrade +# Joel called out 2026-05-30 ("Relying on stale builds is dumb" / +# "fix properly. What broke, what is the long term goal"). +# +# What broke (concrete): PR #1476 (avatars context fix) fixed the +# `docker compose build` error; install.sh then proceeded to +# `compose pull` which failed (pr-1476 image hadn't been pushed via +# scripts/push-current-arch.sh), and silently fell through to +# `compose up` → docker build → cargo build --release → 25min +# timeout. The avatars fix WORKED; the deeper issue is the silent +# downgrade after pull failure. +# +# Long-term goal: every PR's install-smoke tests THIS PR's binary, +# fast and reliably. That requires the pre-built image to exist +# (dev pre-push pipeline publishes pr-N). When the publish didn't +# happen, the smoke should fail LOUDLY ("image missing, push via +# scripts/push-current-arch.sh") instead of silently slipping into +# a 25-min build that times out OR worse, silently using a stale +# canary image and reporting "tests pass!" on someone else's binary. +# +# CONTINUUM_IMAGE_TAG comes from the workflow (pr-N for PRs, canary +# for manual triggers). We check the variants install path pulls: +# continuum-core-vulkan is the heavy one; the lighter siblings +# (node-server, widget-server, model-init) share the tag scheme. +# Operator escape hatch: CARL_ALLOW_LOCAL_BUILD=1 opts into the +# install.sh fallback — useful when explicitly debugging the build +# path, NOT for production CI. +REQUIRED_IMAGE_VARIANTS=( + "continuum-core-vulkan" + "node-server" + "widget-server" + "model-init" +) +RESOLVED_TAG="${CONTINUUM_IMAGE_TAG:-canary}" +MISSING_IMAGES=() +echo "" +echo "━━━ pre-flight: verifying ghcr.io images at :${RESOLVED_TAG} ━━━" +for variant in "${REQUIRED_IMAGE_VARIANTS[@]}"; do + ref="ghcr.io/cambriantech/${variant}:${RESOLVED_TAG}" + if docker manifest inspect "$ref" >/dev/null 2>&1; then + echo " ✓ $ref" + else + echo " ✗ $ref (MISSING)" + MISSING_IMAGES+=("$ref") + fi +done + +if [ ${#MISSING_IMAGES[@]} -gt 0 ]; then + echo "" + echo "❌ Required images missing at :${RESOLVED_TAG} — refusing to silently fall" + echo " through to install.sh's local-build path." + echo "" + echo " Missing:" + for img in "${MISSING_IMAGES[@]}"; do + echo " $img" + done + echo "" + echo " Root cause: the dev pre-push pipeline didn't publish images for this PR." + echo " Architecturally — CI is for CHECK, not BUILD (Joel 2026-04-23). Devs" + echo " publish images via scripts/push-current-arch.sh before push; the CI" + echo " smoke uses the pre-built images and times the install path end-to-end." + echo "" + echo " To unblock this run on a build machine that supports the target arch:" + echo " scripts/push-current-arch.sh" + echo " Then re-run this workflow. The publish pipeline tags pr-\${PR_NUMBER}." + echo "" + echo " For PRs that genuinely don't change the binary (docker-compose tweaks," + echo " docs, ts-only): the dev push pipeline already aliases pr-N from canary" + echo " in that case (see scripts/push-image.sh manifest copy path) — running" + echo " scripts/push-current-arch.sh from any dev box is the right move." + echo "" + echo " Operator override (debugging only, NOT for production CI): set" + echo " CARL_ALLOW_LOCAL_BUILD=1" + echo " in the workflow env to fall through to install.sh's local-build." + echo " This will likely time out at CARL_INSTALL_TIMEOUT_SEC=${CARL_INSTALL_TIMEOUT_SEC}s" + echo " and tests the LOCAL build, not the published image." + if [ "${CARL_ALLOW_LOCAL_BUILD:-0}" = "1" ]; then + echo "" + echo " CARL_ALLOW_LOCAL_BUILD=1 set — continuing into the local-build fallback." + else + exit 1 + fi +fi + # ── 1. Run Carl's exact install command ─────────────────────── echo "" echo "━━━ running install.sh from $CARL_INSTALL_REF ━━━" From 52169d9298e1a042cff60a3cdbd5c0dde8ce62ac Mon Sep 17 00:00:00 2001 From: joelteply Date: Sat, 30 May 2026 12:37:25 -0500 Subject: [PATCH 2/2] fix(ci): only gate install-smoke precheck on heavy Rust image MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit First iteration of the precheck required ALL 4 images (continuum-core- vulkan, node-server, widget-server, model-init). Initial run on this PR (#1480) revealed canary has continuum-core-vulkan published but the lighter TS sidecar images (node-server, widget-server, model-init) aren't always at the canary tag — the dev push pipeline publishes the Rust slice on different cadences than the TS slices. Per Joel 2026-05-30: "node-server / model-init / widgets ... build in under a minute on either arch." Those local builds DON'T blow the 25-min timeout that triggered the original failure mode. So gating the smoke on all 4 images is over-strict — it fails the gate for the common case where canary's Rust is fresh but the TS sidecars aren't yet published at that tag. Refinement: precheck gates only on continuum-core-vulkan (the heavy one whose local build is the 25-min cargo build --release). The lighter TS sidecars are documented as "pulled if present, built locally if not" — install.sh's existing compose-pull-then-build fallback is fine for those because their local build is fast. This restores the intended semantic: catch the SLOW silent fallback (Rust source build) and fail-loud; let the FAST sidecar fallback through as install.sh always did. Co-Authored-By: Claude Opus 4.7 --- scripts/ci/carl-install-smoke.sh | 51 ++++++++++++++++++-------------- 1 file changed, 29 insertions(+), 22 deletions(-) diff --git a/scripts/ci/carl-install-smoke.sh b/scripts/ci/carl-install-smoke.sh index c597791f4..376848905 100644 --- a/scripts/ci/carl-install-smoke.sh +++ b/scripts/ci/carl-install-smoke.sh @@ -99,32 +99,39 @@ trap teardown EXIT INT TERM # a 25-min build that times out OR worse, silently using a stale # canary image and reporting "tests pass!" on someone else's binary. # -# CONTINUUM_IMAGE_TAG comes from the workflow (pr-N for PRs, canary -# for manual triggers). We check the variants install path pulls: -# continuum-core-vulkan is the heavy one; the lighter siblings -# (node-server, widget-server, model-init) share the tag scheme. -# Operator escape hatch: CARL_ALLOW_LOCAL_BUILD=1 opts into the -# install.sh fallback — useful when explicitly debugging the build +# Only the HEAVY Rust binary image (continuum-core-vulkan) must exist +# pre-built — that's the one whose local build is a 25-min cargo +# build --release that hits CARL_INSTALL_TIMEOUT_SEC. The lighter TS +# images (node-server, widget-server, model-init) build in under a +# minute on either arch per Joel 2026-05-30 — install.sh's fallback +# building them locally is acceptable, doesn't blow the timeout. +# +# This split avoids the precheck mis-firing on the common case where +# canary has the Rust image fresh (BigMama pushed) but the lighter +# TS sidecar images haven't been pushed yet under the canary tag. +# Just the Rust image being present is sufficient to make the smoke +# fast and meaningful. +# +# CONTINUUM_IMAGE_TAG comes from the workflow (canary by default +# per the carl-install-smoke.yml change in this commit). Operator +# escape hatch: CARL_ALLOW_LOCAL_BUILD=1 opts into install.sh's +# full fallback — useful when explicitly debugging the heavy build # path, NOT for production CI. -REQUIRED_IMAGE_VARIANTS=( - "continuum-core-vulkan" - "node-server" - "widget-server" - "model-init" -) +RUST_BINARY_IMAGE="continuum-core-vulkan" RESOLVED_TAG="${CONTINUUM_IMAGE_TAG:-canary}" MISSING_IMAGES=() echo "" -echo "━━━ pre-flight: verifying ghcr.io images at :${RESOLVED_TAG} ━━━" -for variant in "${REQUIRED_IMAGE_VARIANTS[@]}"; do - ref="ghcr.io/cambriantech/${variant}:${RESOLVED_TAG}" - if docker manifest inspect "$ref" >/dev/null 2>&1; then - echo " ✓ $ref" - else - echo " ✗ $ref (MISSING)" - MISSING_IMAGES+=("$ref") - fi -done +echo "━━━ pre-flight: verifying heavy ghcr.io image at :${RESOLVED_TAG} ━━━" +RUST_REF="ghcr.io/cambriantech/${RUST_BINARY_IMAGE}:${RESOLVED_TAG}" +if docker manifest inspect "$RUST_REF" >/dev/null 2>&1; then + echo " ✓ $RUST_REF" +else + echo " ✗ $RUST_REF (MISSING — heavy build, blocks the smoke)" + MISSING_IMAGES+=("$RUST_REF") +fi +echo " (lighter TS sidecars node-server / widget-server / model-init" +echo " will be pulled if present, built locally if not — sub-minute" +echo " cost either way; not gated by this pre-flight)" if [ ${#MISSING_IMAGES[@]} -gt 0 ]; then echo ""