Skip to content

fix(ci): canary tag default for install-smoke + fail-loud precheck#1480

Merged
joelteply merged 2 commits into
canaryfrom
fix/ci-install-smoke-tag-fallback
May 30, 2026
Merged

fix(ci): canary tag default for install-smoke + fail-loud precheck#1480
joelteply merged 2 commits into
canaryfrom
fix/ci-install-smoke-tag-fallback

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

install-smoke was silently downgrading to "build continuum-core from source" when the PR-scoped docker image hadn't been pushed yet — burning 25+ minutes per PR run before timing out at the CARL_INSTALL_TIMEOUT_SEC cap of 1500s.

This PR adds a precheck that picks the right image tag and warns when it's falling back:

  • pr-N if the dev pushed it via scripts/push-current-arch.sh
  • canary otherwise (most-recently-published stable), with a GitHub Actions warning annotation surfacing the fallback

The failure mode this unblocks

PR #1476 (avatars context fix) is correct and unblocks the docker compose build step. But install.sh's compose pull 2>/dev/null || warn silently fell through to compose up, which triggered a docker build of continuum-core-vulkan from source. On the no-GPU runner that's a full cargo build --release — 25+ min wall, hit the timeout. PR #1476 failed install-smoke at 25m45s purely because no one had pushed pr-1476 image (and shouldn't have to — the PR doesn't change Rust source).

Why the workflow-level fix is right

Per Joel 2026-05-30 architectural pick: "Fix install-smoke to use pre-built image first." Two reasons the per-PR push requirement was wrong for non-Rust PRs:

  1. Non-Rust PRs don't need a fresh binary. docker-compose tweaks, install script fixes, ts-only changes are functionally identical to canary's binary. Forcing a 25-min from-source build to test them is just noise.
  2. Silent downgrade is the wrong default. If pr-N is missing, the workflow should either fall back loudly OR fail loud. Building from source then timing out 25 min later is the worst-of-both.

The new behavior:

  • pr-N exists → smoke runs against THIS PR's binary (current behavior, unchanged)
  • pr-N missing → smoke runs against :canary AND surfaces a warning annotation. PR author can decide: "do I need my actual binary tested, or is canary's fine?"

For PRs that DO change Rust source (e.g. #1475 Mac Intel hardware tier), the warning is the signal that someone needs to push the image before the smoke is meaningful.

Test plan

Followups (task tracker)

  • Add a CI lint that validates docker compose config resolves all additional_contexts (would have caught the avatars dangling line in seconds instead of 6+ weeks). Tracked as task Build(deps-dev): Bump @types/node from 22.14.0 to 22.15.17 #54.
  • Possibly: add WIP-style required-check enforcement that requires pr-N image to be present for PRs touching src/workers/** (instead of canary-fallback for those). Out of scope here.

Two complementary changes, both architecturally driven by Joel
2026-05-30: "We don't need to rebuild all docker obviously until we
go into main. Takes a lot of machines. ... Fix properly. What broke,
what is the long term goal."

What broke: PR #1476's avatars-context fix succeeded but install-smoke
still failed at 25m45s. The 'pull pr-N image, silently fall back to
local build if missing' chain meant that for ANY PR where the dev
hadn't run scripts/push-current-arch.sh, install.sh's
`compose pull 2>/dev/null || warn ... will build locally` slipped into
`compose up` → `docker build` → `cargo build --release` → timeout.
That's the wrong default in two dimensions: per-PR docker rebuilds
aren't worth it at the canary level (would consume many machines per
PR), and the silent downgrade hides the actual issue (image missing)
behind a 25-min compute burn.

Long-term goal: the docker build is bloated by Node-legacy chat surface
that the Rust-core / thin-Node-client extraction will remove. Once
that's done, builds are small enough that per-PR images become viable.
Until then, canary PR install-smoke validates the install PATH against
canary's binary; the BINARY validation runs at main promotion when
fresh images get built.

Two changes:

1. .github/workflows/carl-install-smoke.yml — default to :canary for
   every PR run (and manual triggers). The previous logic interpolated
   to pr-${PR_NUMBER} for PRs, which silently required an image that
   the canary-stage workflow shouldn't depend on. workflow_dispatch
   `image_tag` input still works for the rare explicit pr-N case
   (binary regression debug, historical canary check, etc.).

2. scripts/ci/carl-install-smoke.sh — add a pre-flight check that
   verifies all 4 required image variants (continuum-core-vulkan,
   node-server, widget-server, model-init) exist at the resolved tag.
   If missing, fail-LOUD with a concrete diagnostic ("dev push pipeline
   didn't publish, run scripts/push-current-arch.sh") instead of
   silently falling through to install.sh's local-build path. The
   CARL_ALLOW_LOCAL_BUILD=1 escape hatch is preserved for explicit
   build-path debugging.

Net effect:
- canary PRs (the common case) → tag :canary → images exist → install
  smoke runs against canary's binary in normal time.
- canary images somehow missing (real bug) → fail-LOUD with actionable
  message, not silent 25-min timeout.
- main-promotion runs and explicit pr-N tests → still work via
  workflow_dispatch input.

The avatars-context fix from PR #1476 is NOT included here — it's a
separate concern (the docker-compose dangling line); PR #1476 lands
that piece. This commit fixes the CI-side silent-downgrade pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply joelteply force-pushed the fix/ci-install-smoke-tag-fallback branch from c91197b to 485ed49 Compare May 30, 2026 17:34
@joelteply joelteply changed the title fix(ci): install-smoke falls back to :canary when pr-N image isn't pushed fix(ci): canary tag default for install-smoke + fail-loud precheck May 30, 2026
First iteration of the precheck required ALL 4 images (continuum-core-
vulkan, node-server, widget-server, model-init). Initial run on this
PR (#1480) revealed canary has continuum-core-vulkan published but
the lighter TS sidecar images (node-server, widget-server, model-init)
aren't always at the canary tag — the dev push pipeline publishes the
Rust slice on different cadences than the TS slices.

Per Joel 2026-05-30: "node-server / model-init / widgets ... build in
under a minute on either arch." Those local builds DON'T blow the
25-min timeout that triggered the original failure mode. So gating
the smoke on all 4 images is over-strict — it fails the gate for the
common case where canary's Rust is fresh but the TS sidecars aren't
yet published at that tag.

Refinement: precheck gates only on continuum-core-vulkan (the heavy
one whose local build is the 25-min cargo build --release). The
lighter TS sidecars are documented as "pulled if present, built
locally if not" — install.sh's existing compose-pull-then-build
fallback is fine for those because their local build is fast.

This restores the intended semantic: catch the SLOW silent fallback
(Rust source build) and fail-loud; let the FAST sidecar fallback
through as install.sh always did.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply joelteply merged commit 86d8c56 into canary May 30, 2026
2 of 3 checks passed
@joelteply joelteply deleted the fix/ci-install-smoke-tag-fallback branch May 30, 2026 17:43
joelteply added a commit that referenced this pull request May 30, 2026
#1481)

continuum-core's Dockerfile creates /root/.continuum/sockets at image
build time, but docker-compose.yml mounts the host's ~/.continuum
onto /root/.continuum at container start. The mount overlays the
image's directory tree — the sockets/ subdir created at build is
invisible inside the running container. continuum-core then tries
to bind its IPC socket at /root/.continuum/sockets/continuum-core.sock,
which fails with "IPC server error: No such file or directory
(os error 2)" because the parent dir doesn't exist.

Symptom: continuum-core never goes healthy → node-server's depends_on
(condition: service_healthy) fails → docker compose up exits 1 with
"dependency failed to start: container continuum-core-1 is unhealthy".

Concrete trace from canary install-smoke for PR #1480 today:
  17:40:25 — All 28 modules initialized, tick loops started
  17:40:25 — ❌ IPC server error: No such file or directory (os error 2)
  17:40:26 — Container Error / Waiting → Healthcheck never passes
  install.sh exits at "start support services" phase

This bug has been silently blocking install-smoke for any docker-stack-
touching PR; the previous 25-min cargo-build timeout was masking it
because the install never got far enough to discover the socket issue.
Now that PR #1480's precheck + canary-default routing makes the run
fast, the underlying problem surfaces in 3 minutes with a clear error.

Fix: pre-create the host-side directory tree (sockets/, jtag/data/,
jtag/logs/) BEFORE compose up. This way the bind mount delivers a
populated /root/.continuum to the container and continuum-core can
bind its socket on first start.

This is install.sh-side, not Dockerfile-side, because the mount is the
overlaying layer — image-build mkdirs are hidden by the bind. The
canonical fix is to mkdir on the host (which is what gets mounted).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant