fix(ci): canary tag default for install-smoke + fail-loud precheck#1480
Merged
Conversation
Two complementary changes, both architecturally driven by Joel 2026-05-30: "We don't need to rebuild all docker obviously until we go into main. Takes a lot of machines. ... Fix properly. What broke, what is the long term goal." What broke: PR #1476's avatars-context fix succeeded but install-smoke still failed at 25m45s. The 'pull pr-N image, silently fall back to local build if missing' chain meant that for ANY PR where the dev hadn't run scripts/push-current-arch.sh, install.sh's `compose pull 2>/dev/null || warn ... will build locally` slipped into `compose up` → `docker build` → `cargo build --release` → timeout. That's the wrong default in two dimensions: per-PR docker rebuilds aren't worth it at the canary level (would consume many machines per PR), and the silent downgrade hides the actual issue (image missing) behind a 25-min compute burn. Long-term goal: the docker build is bloated by Node-legacy chat surface that the Rust-core / thin-Node-client extraction will remove. Once that's done, builds are small enough that per-PR images become viable. Until then, canary PR install-smoke validates the install PATH against canary's binary; the BINARY validation runs at main promotion when fresh images get built. Two changes: 1. .github/workflows/carl-install-smoke.yml — default to :canary for every PR run (and manual triggers). The previous logic interpolated to pr-${PR_NUMBER} for PRs, which silently required an image that the canary-stage workflow shouldn't depend on. workflow_dispatch `image_tag` input still works for the rare explicit pr-N case (binary regression debug, historical canary check, etc.). 2. scripts/ci/carl-install-smoke.sh — add a pre-flight check that verifies all 4 required image variants (continuum-core-vulkan, node-server, widget-server, model-init) exist at the resolved tag. If missing, fail-LOUD with a concrete diagnostic ("dev push pipeline didn't publish, run scripts/push-current-arch.sh") instead of silently falling through to install.sh's local-build path. The CARL_ALLOW_LOCAL_BUILD=1 escape hatch is preserved for explicit build-path debugging. Net effect: - canary PRs (the common case) → tag :canary → images exist → install smoke runs against canary's binary in normal time. - canary images somehow missing (real bug) → fail-LOUD with actionable message, not silent 25-min timeout. - main-promotion runs and explicit pr-N tests → still work via workflow_dispatch input. The avatars-context fix from PR #1476 is NOT included here — it's a separate concern (the docker-compose dangling line); PR #1476 lands that piece. This commit fixes the CI-side silent-downgrade pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c91197b to
485ed49
Compare
First iteration of the precheck required ALL 4 images (continuum-core- vulkan, node-server, widget-server, model-init). Initial run on this PR (#1480) revealed canary has continuum-core-vulkan published but the lighter TS sidecar images (node-server, widget-server, model-init) aren't always at the canary tag — the dev push pipeline publishes the Rust slice on different cadences than the TS slices. Per Joel 2026-05-30: "node-server / model-init / widgets ... build in under a minute on either arch." Those local builds DON'T blow the 25-min timeout that triggered the original failure mode. So gating the smoke on all 4 images is over-strict — it fails the gate for the common case where canary's Rust is fresh but the TS sidecars aren't yet published at that tag. Refinement: precheck gates only on continuum-core-vulkan (the heavy one whose local build is the 25-min cargo build --release). The lighter TS sidecars are documented as "pulled if present, built locally if not" — install.sh's existing compose-pull-then-build fallback is fine for those because their local build is fast. This restores the intended semantic: catch the SLOW silent fallback (Rust source build) and fail-loud; let the FAST sidecar fallback through as install.sh always did. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
joelteply
added a commit
that referenced
this pull request
May 30, 2026
#1481) continuum-core's Dockerfile creates /root/.continuum/sockets at image build time, but docker-compose.yml mounts the host's ~/.continuum onto /root/.continuum at container start. The mount overlays the image's directory tree — the sockets/ subdir created at build is invisible inside the running container. continuum-core then tries to bind its IPC socket at /root/.continuum/sockets/continuum-core.sock, which fails with "IPC server error: No such file or directory (os error 2)" because the parent dir doesn't exist. Symptom: continuum-core never goes healthy → node-server's depends_on (condition: service_healthy) fails → docker compose up exits 1 with "dependency failed to start: container continuum-core-1 is unhealthy". Concrete trace from canary install-smoke for PR #1480 today: 17:40:25 — All 28 modules initialized, tick loops started 17:40:25 — ❌ IPC server error: No such file or directory (os error 2) 17:40:26 — Container Error / Waiting → Healthcheck never passes install.sh exits at "start support services" phase This bug has been silently blocking install-smoke for any docker-stack- touching PR; the previous 25-min cargo-build timeout was masking it because the install never got far enough to discover the socket issue. Now that PR #1480's precheck + canary-default routing makes the run fast, the underlying problem surfaces in 3 minutes with a clear error. Fix: pre-create the host-side directory tree (sockets/, jtag/data/, jtag/logs/) BEFORE compose up. This way the bind mount delivers a populated /root/.continuum to the container and continuum-core can bind its socket on first start. This is install.sh-side, not Dockerfile-side, because the mount is the overlaying layer — image-build mkdirs are hidden by the bind. The canonical fix is to mkdir on the host (which is what gets mounted). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
install-smokewas silently downgrading to "build continuum-core from source" when the PR-scoped docker image hadn't been pushed yet — burning 25+ minutes per PR run before timing out at theCARL_INSTALL_TIMEOUT_SECcap of 1500s.This PR adds a precheck that picks the right image tag and warns when it's falling back:
pr-Nif the dev pushed it viascripts/push-current-arch.shcanaryotherwise (most-recently-published stable), with a GitHub Actions warning annotation surfacing the fallbackThe failure mode this unblocks
PR #1476 (avatars context fix) is correct and unblocks the docker compose build step. But
install.sh'scompose pull 2>/dev/null || warnsilently fell through tocompose up, which triggered adocker buildof continuum-core-vulkan from source. On the no-GPU runner that's a fullcargo build --release— 25+ min wall, hit the timeout. PR #1476 failed install-smoke at 25m45s purely because no one had pushed pr-1476 image (and shouldn't have to — the PR doesn't change Rust source).Why the workflow-level fix is right
Per Joel 2026-05-30 architectural pick: "Fix install-smoke to use pre-built image first." Two reasons the per-PR push requirement was wrong for non-Rust PRs:
The new behavior:
:canaryAND surfaces a warning annotation. PR author can decide: "do I need my actual binary tested, or is canary's fine?"For PRs that DO change Rust source (e.g. #1475 Mac Intel hardware tier), the warning is the signal that someone needs to push the image before the smoke is meaningful.
Test plan
.github/workflows/carl-install-smoke.yml)Followups (task tracker)
docker compose configresolves alladditional_contexts(would have caught the avatars dangling line in seconds instead of 6+ weeks). Tracked as task Build(deps-dev): Bump @types/node from 22.14.0 to 22.15.17 #54.WIP-style required-check enforcement that requires pr-N image to be present for PRs touchingsrc/workers/**(instead of canary-fallback for those). Out of scope here.