Skip to content

fix(install): remove dangling avatars build context from docker-compose#1476

Merged
joelteply merged 1 commit into
canaryfrom
fix/docker-compose-dangling-avatars-context
May 30, 2026
Merged

fix(install): remove dangling avatars build context from docker-compose#1476
joelteply merged 1 commit into
canaryfrom
fix/docker-compose-dangling-avatars-context

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

Remove dangling avatars: ./src/models/avatars additional_context from docker-compose.yml that has been silently breaking carl-install-smoke for any PR that touches install.sh.

Root cause

April 2026 commit 9b1f6ca added the build context expecting CC0 avatar VRMs to be baked in. That plan rolled back — docker/continuum-core.Dockerfile lines 131-143 document the rollback (src/models is gitignored; Dockerfile uses RUN mkdir -p /app/avatars placeholder instead).

The compose-side additional_contexts declaration was left behind. No Dockerfile uses --from=avatars (verified by grep), so the declaration referenced nothing — but docker compose validates ALL declared additional_contexts resolve at build time. Missing local context dir fails the build with stat .../src/models/avatars: no such file or directory.

Impact

PR #1475 (Mac Intel hardware tier) currently fails carl-install-smoke because it touches install.sh → triggers the check → docker build fails on the missing avatars context. PRs that DON'T touch install.sh (#1471, #1473, #1474) didn't run the check, so the break was invisible until now.

Fix

Remove the dangling line. Replace with an explanatory comment so a future contributor doesn't re-add it without restoring the avatar-provisioning story (LFS, model-init download, or CC0-URL curl) per the gap in docs/infrastructure/PR891-E2E-VALIDATION.md.

Test plan

joelteply added a commit that referenced this pull request May 30, 2026
Two complementary changes, both architecturally driven by Joel
2026-05-30: "We don't need to rebuild all docker obviously until we
go into main. Takes a lot of machines. ... Fix properly. What broke,
what is the long term goal."

What broke: PR #1476's avatars-context fix succeeded but install-smoke
still failed at 25m45s. The 'pull pr-N image, silently fall back to
local build if missing' chain meant that for ANY PR where the dev
hadn't run scripts/push-current-arch.sh, install.sh's
`compose pull 2>/dev/null || warn ... will build locally` slipped into
`compose up` → `docker build` → `cargo build --release` → timeout.
That's the wrong default in two dimensions: per-PR docker rebuilds
aren't worth it at the canary level (would consume many machines per
PR), and the silent downgrade hides the actual issue (image missing)
behind a 25-min compute burn.

Long-term goal: the docker build is bloated by Node-legacy chat surface
that the Rust-core / thin-Node-client extraction will remove. Once
that's done, builds are small enough that per-PR images become viable.
Until then, canary PR install-smoke validates the install PATH against
canary's binary; the BINARY validation runs at main promotion when
fresh images get built.

Two changes:

1. .github/workflows/carl-install-smoke.yml — default to :canary for
   every PR run (and manual triggers). The previous logic interpolated
   to pr-${PR_NUMBER} for PRs, which silently required an image that
   the canary-stage workflow shouldn't depend on. workflow_dispatch
   `image_tag` input still works for the rare explicit pr-N case
   (binary regression debug, historical canary check, etc.).

2. scripts/ci/carl-install-smoke.sh — add a pre-flight check that
   verifies all 4 required image variants (continuum-core-vulkan,
   node-server, widget-server, model-init) exist at the resolved tag.
   If missing, fail-LOUD with a concrete diagnostic ("dev push pipeline
   didn't publish, run scripts/push-current-arch.sh") instead of
   silently falling through to install.sh's local-build path. The
   CARL_ALLOW_LOCAL_BUILD=1 escape hatch is preserved for explicit
   build-path debugging.

Net effect:
- canary PRs (the common case) → tag :canary → images exist → install
  smoke runs against canary's binary in normal time.
- canary images somehow missing (real bug) → fail-LOUD with actionable
  message, not silent 25-min timeout.
- main-promotion runs and explicit pr-N tests → still work via
  workflow_dispatch input.

The avatars-context fix from PR #1476 is NOT included here — it's a
separate concern (the docker-compose dangling line); PR #1476 lands
that piece. This commit fixes the CI-side silent-downgrade pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
joelteply added a commit that referenced this pull request May 30, 2026
…1480)

* fix(ci): canary tag default for install-smoke + fail-loud precheck

Two complementary changes, both architecturally driven by Joel
2026-05-30: "We don't need to rebuild all docker obviously until we
go into main. Takes a lot of machines. ... Fix properly. What broke,
what is the long term goal."

What broke: PR #1476's avatars-context fix succeeded but install-smoke
still failed at 25m45s. The 'pull pr-N image, silently fall back to
local build if missing' chain meant that for ANY PR where the dev
hadn't run scripts/push-current-arch.sh, install.sh's
`compose pull 2>/dev/null || warn ... will build locally` slipped into
`compose up` → `docker build` → `cargo build --release` → timeout.
That's the wrong default in two dimensions: per-PR docker rebuilds
aren't worth it at the canary level (would consume many machines per
PR), and the silent downgrade hides the actual issue (image missing)
behind a 25-min compute burn.

Long-term goal: the docker build is bloated by Node-legacy chat surface
that the Rust-core / thin-Node-client extraction will remove. Once
that's done, builds are small enough that per-PR images become viable.
Until then, canary PR install-smoke validates the install PATH against
canary's binary; the BINARY validation runs at main promotion when
fresh images get built.

Two changes:

1. .github/workflows/carl-install-smoke.yml — default to :canary for
   every PR run (and manual triggers). The previous logic interpolated
   to pr-${PR_NUMBER} for PRs, which silently required an image that
   the canary-stage workflow shouldn't depend on. workflow_dispatch
   `image_tag` input still works for the rare explicit pr-N case
   (binary regression debug, historical canary check, etc.).

2. scripts/ci/carl-install-smoke.sh — add a pre-flight check that
   verifies all 4 required image variants (continuum-core-vulkan,
   node-server, widget-server, model-init) exist at the resolved tag.
   If missing, fail-LOUD with a concrete diagnostic ("dev push pipeline
   didn't publish, run scripts/push-current-arch.sh") instead of
   silently falling through to install.sh's local-build path. The
   CARL_ALLOW_LOCAL_BUILD=1 escape hatch is preserved for explicit
   build-path debugging.

Net effect:
- canary PRs (the common case) → tag :canary → images exist → install
  smoke runs against canary's binary in normal time.
- canary images somehow missing (real bug) → fail-LOUD with actionable
  message, not silent 25-min timeout.
- main-promotion runs and explicit pr-N tests → still work via
  workflow_dispatch input.

The avatars-context fix from PR #1476 is NOT included here — it's a
separate concern (the docker-compose dangling line); PR #1476 lands
that piece. This commit fixes the CI-side silent-downgrade pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): only gate install-smoke precheck on heavy Rust image

First iteration of the precheck required ALL 4 images (continuum-core-
vulkan, node-server, widget-server, model-init). Initial run on this
PR (#1480) revealed canary has continuum-core-vulkan published but
the lighter TS sidecar images (node-server, widget-server, model-init)
aren't always at the canary tag — the dev push pipeline publishes the
Rust slice on different cadences than the TS slices.

Per Joel 2026-05-30: "node-server / model-init / widgets ... build in
under a minute on either arch." Those local builds DON'T blow the
25-min timeout that triggered the original failure mode. So gating
the smoke on all 4 images is over-strict — it fails the gate for the
common case where canary's Rust is fresh but the TS sidecars aren't
yet published at that tag.

Refinement: precheck gates only on continuum-core-vulkan (the heavy
one whose local build is the 25-min cargo build --release). The
lighter TS sidecars are documented as "pulled if present, built
locally if not" — install.sh's existing compose-pull-then-build
fallback is fine for those because their local build is fast.

This restores the intended semantic: catch the SLOW silent fallback
(Rust source build) and fail-loud; let the FAST sidecar fallback
through as install.sh always did.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
The `avatars: ./src/models/avatars` additional_context was added in
9b1f6ca (April 2026) when the plan was to bake CC0 avatar VRMs
into the continuum-core image. That plan never landed end-to-end —
docker/continuum-core.Dockerfile lines 131-143 document the rollback:
src/models is gitignored, the dir doesn't exist in CI checkouts,
and the Dockerfile uses `RUN mkdir -p /app/avatars` as a placeholder
instead of COPYing from the avatars context.

The compose-side context declaration was left behind, dangling. No
Dockerfile uses `--from=avatars` (verified by grep), so the declaration
referenced nothing in build instructions. But docker compose validates
that ALL additional_contexts resolve at build time — a missing local
context dir fails the whole build with "stat /tmp/carl-smoke-NNNN/src/
models/avatars: no such file or directory".

That's the exact failure mode currently blocking carl-install-smoke
on PR #1475 (Mac Intel hardware tier) — any PR that touches install.sh
triggers carl-install-smoke, which has been silently broken by this
dangling context since the rollback. Other PRs (e.g. #1471, #1473,
#1474) didn't touch install.sh so the check never ran on them; the
break was invisible until now.

Removing the line restores the carl-install-smoke happy path while
keeping the Dockerfile's empty-dir placeholder intact. Restore the
build context when the avatar-provisioning story lands (LFS, model-init
download, or curl from a CC0 URL in CI before docker build) per the
gap noted in docs/infrastructure/PR891-E2E-VALIDATION.md.

Inline comment preserves the context-of-removal in the file so a
future contributor doesn't re-add the dangling line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply joelteply force-pushed the fix/docker-compose-dangling-avatars-context branch from f7cdcfb to 41d3148 Compare May 30, 2026 18:34
@joelteply joelteply merged commit 6eadbc2 into canary May 30, 2026
3 checks passed
@joelteply joelteply deleted the fix/docker-compose-dangling-avatars-context branch May 30, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant