fix(ci): canary tag default for install-smoke + fail-loud precheck by joelteply · Pull Request #1480 · CambrianTech/continuum

joelteply · 2026-05-30T17:28:14Z

Summary

install-smoke was silently downgrading to "build continuum-core from source" when the PR-scoped docker image hadn't been pushed yet — burning 25+ minutes per PR run before timing out at the CARL_INSTALL_TIMEOUT_SEC cap of 1500s.

This PR adds a precheck that picks the right image tag and warns when it's falling back:

pr-N if the dev pushed it via scripts/push-current-arch.sh
canary otherwise (most-recently-published stable), with a GitHub Actions warning annotation surfacing the fallback

The failure mode this unblocks

PR #1476 (avatars context fix) is correct and unblocks the docker compose build step. But install.sh's compose pull 2>/dev/null || warn silently fell through to compose up, which triggered a docker build of continuum-core-vulkan from source. On the no-GPU runner that's a full cargo build --release — 25+ min wall, hit the timeout. PR #1476 failed install-smoke at 25m45s purely because no one had pushed pr-1476 image (and shouldn't have to — the PR doesn't change Rust source).

Why the workflow-level fix is right

Per Joel 2026-05-30 architectural pick: "Fix install-smoke to use pre-built image first." Two reasons the per-PR push requirement was wrong for non-Rust PRs:

Non-Rust PRs don't need a fresh binary. docker-compose tweaks, install script fixes, ts-only changes are functionally identical to canary's binary. Forcing a 25-min from-source build to test them is just noise.
Silent downgrade is the wrong default. If pr-N is missing, the workflow should either fall back loudly OR fail loud. Building from source then timing out 25 min later is the worst-of-both.

The new behavior:

pr-N exists → smoke runs against THIS PR's binary (current behavior, unchanged)
pr-N missing → smoke runs against :canary AND surfaces a warning annotation. PR author can decide: "do I need my actual binary tested, or is canary's fine?"

For PRs that DO change Rust source (e.g. #1475 Mac Intel hardware tier), the warning is the signal that someone needs to push the image before the smoke is meaningful.

Test plan

YAML syntax valid (commit verified)
CI green on this PR (install-smoke triggers via paths filter on .github/workflows/carl-install-smoke.yml)
Once merged, re-run install-smoke on fix(install): remove dangling avatars build context from docker-compose #1476 — should fall back to :canary and pass
Then rebase feat: Mac Intel hardware tier + cognition perf pass #1475 → install-smoke for feat: Mac Intel hardware tier + cognition perf pass #1475 will surface the warning (Rust changes, no image pushed); we admin-merge or push the image at that point

Followups (task tracker)

Add a CI lint that validates docker compose config resolves all additional_contexts (would have caught the avatars dangling line in seconds instead of 6+ weeks). Tracked as task Build(deps-dev): Bump @types/node from 22.14.0 to 22.15.17 #54.
Possibly: add WIP-style required-check enforcement that requires pr-N image to be present for PRs touching src/workers/** (instead of canary-fallback for those). Out of scope here.

Two complementary changes, both architecturally driven by Joel 2026-05-30: "We don't need to rebuild all docker obviously until we go into main. Takes a lot of machines. ... Fix properly. What broke, what is the long term goal." What broke: PR #1476's avatars-context fix succeeded but install-smoke still failed at 25m45s. The 'pull pr-N image, silently fall back to local build if missing' chain meant that for ANY PR where the dev hadn't run scripts/push-current-arch.sh, install.sh's `compose pull 2>/dev/null || warn ... will build locally` slipped into `compose up` → `docker build` → `cargo build --release` → timeout. That's the wrong default in two dimensions: per-PR docker rebuilds aren't worth it at the canary level (would consume many machines per PR), and the silent downgrade hides the actual issue (image missing) behind a 25-min compute burn. Long-term goal: the docker build is bloated by Node-legacy chat surface that the Rust-core / thin-Node-client extraction will remove. Once that's done, builds are small enough that per-PR images become viable. Until then, canary PR install-smoke validates the install PATH against canary's binary; the BINARY validation runs at main promotion when fresh images get built. Two changes: 1. .github/workflows/carl-install-smoke.yml — default to :canary for every PR run (and manual triggers). The previous logic interpolated to pr-${PR_NUMBER} for PRs, which silently required an image that the canary-stage workflow shouldn't depend on. workflow_dispatch `image_tag` input still works for the rare explicit pr-N case (binary regression debug, historical canary check, etc.). 2. scripts/ci/carl-install-smoke.sh — add a pre-flight check that verifies all 4 required image variants (continuum-core-vulkan, node-server, widget-server, model-init) exist at the resolved tag. If missing, fail-LOUD with a concrete diagnostic ("dev push pipeline didn't publish, run scripts/push-current-arch.sh") instead of silently falling through to install.sh's local-build path. The CARL_ALLOW_LOCAL_BUILD=1 escape hatch is preserved for explicit build-path debugging. Net effect: - canary PRs (the common case) → tag :canary → images exist → install smoke runs against canary's binary in normal time. - canary images somehow missing (real bug) → fail-LOUD with actionable message, not silent 25-min timeout. - main-promotion runs and explicit pr-N tests → still work via workflow_dispatch input. The avatars-context fix from PR #1476 is NOT included here — it's a separate concern (the docker-compose dangling line); PR #1476 lands that piece. This commit fixes the CI-side silent-downgrade pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

First iteration of the precheck required ALL 4 images (continuum-core- vulkan, node-server, widget-server, model-init). Initial run on this PR (#1480) revealed canary has continuum-core-vulkan published but the lighter TS sidecar images (node-server, widget-server, model-init) aren't always at the canary tag — the dev push pipeline publishes the Rust slice on different cadences than the TS slices. Per Joel 2026-05-30: "node-server / model-init / widgets ... build in under a minute on either arch." Those local builds DON'T blow the 25-min timeout that triggered the original failure mode. So gating the smoke on all 4 images is over-strict — it fails the gate for the common case where canary's Rust is fresh but the TS sidecars aren't yet published at that tag. Refinement: precheck gates only on continuum-core-vulkan (the heavy one whose local build is the 25-min cargo build --release). The lighter TS sidecars are documented as "pulled if present, built locally if not" — install.sh's existing compose-pull-then-build fallback is fine for those because their local build is fast. This restores the intended semantic: catch the SLOW silent fallback (Rust source build) and fail-loud; let the FAST sidecar fallback through as install.sh always did. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1481) continuum-core's Dockerfile creates /root/.continuum/sockets at image build time, but docker-compose.yml mounts the host's ~/.continuum onto /root/.continuum at container start. The mount overlays the image's directory tree — the sockets/ subdir created at build is invisible inside the running container. continuum-core then tries to bind its IPC socket at /root/.continuum/sockets/continuum-core.sock, which fails with "IPC server error: No such file or directory (os error 2)" because the parent dir doesn't exist. Symptom: continuum-core never goes healthy → node-server's depends_on (condition: service_healthy) fails → docker compose up exits 1 with "dependency failed to start: container continuum-core-1 is unhealthy". Concrete trace from canary install-smoke for PR #1480 today: 17:40:25 — All 28 modules initialized, tick loops started 17:40:25 — ❌ IPC server error: No such file or directory (os error 2) 17:40:26 — Container Error / Waiting → Healthcheck never passes install.sh exits at "start support services" phase This bug has been silently blocking install-smoke for any docker-stack- touching PR; the previous 25-min cargo-build timeout was masking it because the install never got far enough to discover the socket issue. Now that PR #1480's precheck + canary-default routing makes the run fast, the underlying problem surfaces in 3 minutes with a clear error. Fix: pre-create the host-side directory tree (sockets/, jtag/data/, jtag/logs/) BEFORE compose up. This way the bind mount delivers a populated /root/.continuum to the container and continuum-core can bind its socket on first start. This is install.sh-side, not Dockerfile-side, because the mount is the overlaying layer — image-build mkdirs are hidden by the bind. The canonical fix is to mkdir on the host (which is what gets mounted). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

github-actions Bot added the size: M label May 30, 2026

joelteply force-pushed the fix/ci-install-smoke-tag-fallback branch from c91197b to 485ed49 Compare May 30, 2026 17:34

joelteply changed the title ~~fix(ci): install-smoke falls back to :canary when pr-N image isn't pushed~~ fix(ci): canary tag default for install-smoke + fail-loud precheck May 30, 2026

joelteply merged commit 86d8c56 into canary May 30, 2026
2 of 3 checks passed

joelteply deleted the fix/ci-install-smoke-tag-fallback branch May 30, 2026 17:43

joelteply mentioned this pull request May 30, 2026

fix(install): pre-create ~/.continuum/sockets before docker compose up #1481

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): canary tag default for install-smoke + fail-loud precheck#1480

fix(ci): canary tag default for install-smoke + fail-loud precheck#1480
joelteply merged 2 commits into
canaryfrom
fix/ci-install-smoke-tag-fallback

joelteply commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented May 30, 2026

Summary

The failure mode this unblocks

Why the workflow-level fix is right

Test plan

Followups (task tracker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant