Skip to content

fix(ci): tolerate Docker Hub login failure (issue #82)#83

Merged
konard merged 9 commits into
mainfrom
issue-82-9bbaad39cc07
Apr 29, 2026
Merged

fix(ci): tolerate Docker Hub login failure (issue #82)#83
konard merged 9 commits into
mainfrom
issue-82-9bbaad39cc07

Conversation

@konard
Copy link
Copy Markdown
Member

@konard konard commented Apr 28, 2026

Summary

This PR addresses three CI/CD failure modes captured under issue #82 ("Fix all CI/CD bugs"):

  1. An expired DOCKERHUB_TOKEN cascading into a full release blackout — the original incident on 2026-04-28 (run #25073172386): all 52 Docker build jobs failed at the identical Log in to Docker Hub step with unauthorized: personal access token is expired, and the 14 downstream manifest/release jobs were skipped — blocking publication to GHCR as well, even though the GHCR credential (GITHUB_TOKEN) was still perfectly valid.
  2. docker-build-test running out of disk on ubuntu-24.04 — surfaced by this PR's own CI (run #25075335426): the PR-only smoke job builds JS → essentials → 11 language images → full-box sequentially on a single runner, and on the COPY --from=ruby-stage step it fails with no space left on device (0 MB free).
  3. PR CI did not test every Docker image configuration in parallel — a subsequent review request: every image must be tested on its own VM with maximum free disk, and configurations should run in parallel to keep iteration fast.

Fixes #82.

What changed

Track 1 — Tolerate Docker Hub login failure

  • .github/workflows/release.yml — every Log in to Docker Hub step (15 occurrences across build-js-*, build-essentials-*, build-languages-*, build-dind-*, *-manifest, docker-build-push*) now has id: dockerhub-login + continue-on-error: true, and is immediately followed by a Check Docker Hub login (issue #82) step that emits a ::warning annotation pointing at the rotation runbook when the login fails. GHCR pushes proceed on their existing credentials when Docker Hub is unavailable.
  • README.md — new ## Releasing section documenting which secrets the release workflow expects (DOCKERHUB_USERNAME, DOCKERHUB_TOKEN) and a step-by-step Docker Hub PAT rotation runbook.

Track 2 — Free disk on every build job

  • .github/workflows/release.yml — added jlumbroso/free-disk-space@main (the same step docker-build-push and docker-build-push-arm64 already used from issue We have publishing of new images failed #41) to every build job that lacked it. The set is now 11 jobs (was 3): all pr-test-*, build-{js,essentials,languages,dind}-{amd64,arm64}, and docker-build-push{,-arm64}. Each reclaims ~30 GB before its first build step.

Track 3 — Parallel PR test matrix, isolated per-image VMs

  • .github/workflows/release.yml — replaced the single sequential docker-build-test job with a chain of parallel matrix jobs:

    Job Strategy
    pr-test-js 1 VM
    pr-test-essentials 1 VM, needs: pr-test-js
    pr-test-language matrix × 11 langs in parallel, fail-fast: false
    pr-test-full 1 VM, builds the full chain locally (Dockerfile uses COPY --from=*-stage)
    pr-test-dind matrix × 14 variants in parallel, fail-fast: false
    docker-build-test aggregator → preserves the existing branch-protection check name

    Cross-job layer reuse uses docker/build-push-action with cache-from/cache-to: type=gha to avoid re-doing JS / essentials work in every leaf job.

Documentation & tests

  • docs/case-studies/issue-82/CASE-STUDY.md — incident write-up (timeline, root-cause analysis, contributing factor, alternatives considered, prevention) for all three failure modes plus the rationale for the parallel matrix shape.
  • docs/case-studies/issue-82/ci-logs/ — captured failed-step logs.
  • experiments/test-issue82-dockerhub-login-tolerance.sh — static-analysis check that all 15 login steps in release.yml are non-blocking and paired with a check step.
  • experiments/test-issue82-pr-parallel-tests.sh — static-analysis check that every pr-test-* job exists, every build job (15) has a Free disk space step, the language matrix lists all 11 languages, the dind matrix lists all 14 variants, and docker-build-test aggregates them all.
  • .changeset/issue-82-tolerate-dockerhub-login-failure.md — patch-level changeset covering all three fixes.

Reproducing

Track 1 (PAT expiry): generate a Docker Hub PAT with a short expiry, save as DOCKERHUB_TOKEN, wait for expiry, then gh workflow run "Build and Release Docker Image" --ref main. Before this PR: every Docker build job fails at the login step. After: every job emits ::warning Docker Hub login failed and continues; GHCR completes; Docker Hub pushes fail loudly so the run is still red, but the actionable rotation message appears in run annotations.

Track 2 (disk exhaustion): observable on any PR that triggered the old docker-build-test after the dind variants from #80 — see run #25075335426. With the fix, every leaf build runs on a fresh VM with ~30 GB freed up front.

Track 3 (parallel matrix): push a no-op commit; the PR run now spawns 28+ parallel jobs (1 + 1 + 11 + 1 + 14 + aggregator) instead of 1 sequential job, so the wall-clock for testing every Docker image configuration drops from "the slowest single VM" to "the slowest single image".

Test plan

  • python3 -c "import yaml; yaml.safe_load(open('.github/workflows/release.yml'))" — YAML valid.
  • experiments/test-issue82-dockerhub-login-tolerance.sh — passes.
  • experiments/test-issue82-pr-parallel-tests.sh — passes.
  • PR-event CI on this branch — verify all pr-test-* jobs run on separate VMs, all complete, and docker-build-test aggregator goes green.
  • After merge to main: the next release run should succeed for GHCR even if DOCKERHUB_TOKEN is still expired, and emit a clear warning annotation. This is the operational verification step.

Operational follow-up (cannot be done in code)

Per the runbook in README.md and docs/case-studies/issue-82/CASE-STUDY.md, a repo admin still needs to:

  1. Generate a fresh Docker Hub PAT with Read, Write, Delete scopes.
  2. Update the DOCKERHUB_TOKEN secret at https://github.com/link-foundation/box/settings/secrets/actions
  3. Re-run run 25073172386 (or trigger workflow_dispatch with release_mode=release-only) to re-publish v2.1.0 to Docker Hub.

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #82
@konard konard self-assigned this Apr 28, 2026
A single expired DOCKERHUB_TOKEN took down the entire release workflow
on 2026-04-28 (run 25073172386): all 52 build jobs failed at the
identical "Log in to Docker Hub" step with
`unauthorized: personal access token is expired`, and the dependent
manifest/release jobs were skipped, blocking publication to GHCR even
though the GHCR credential (GITHUB_TOKEN) was still valid.

Make every "Log in to Docker Hub" step in `.github/workflows/release.yml`
non-blocking by adding `id: dockerhub-login` and
`continue-on-error: true`, then immediately follow it with a
"Check Docker Hub login (issue #82)" step that emits a `::warning`
annotation pointing at the rotation runbook when the login fails.
GHCR pushes proceed on their existing credentials when Docker Hub is
unavailable.

Document the secrets the workflow expects and the Docker Hub PAT
rotation runbook in README.md, and ship a full incident write-up at
docs/case-studies/issue-82/CASE-STUDY.md (with captured CI logs).

experiments/test-issue82-dockerhub-login-tolerance.sh statically asserts
that all 15 login steps in release.yml are non-blocking and paired with
a check step.
@konard konard changed the title [WIP] Fix all CI/CD bugs fix(ci): tolerate Docker Hub login failure (issue #82) Apr 28, 2026
@konard konard marked this pull request as ready for review April 28, 2026 19:37
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 28, 2026

Solution summary

Done. Summary:

Issue #82 root cause (verified from run 25073172386): the Docker Hub Personal Access Token in secrets.DOCKERHUB_TOKEN is expired. Every one of 52 build jobs in .github/workflows/release.yml failed at the same Log in to Docker Hub step with unauthorized: personal access token is expired, cascading into 14 skipped downstream jobs and blocking the v2.1.0 release.

Fix shipped in PR #83:

  • All 15 Log in to Docker Hub steps in release.yml are now id: dockerhub-login + continue-on-error: true, paired with a follow-up Check Docker Hub login step that emits a ::warning annotation pointing at the rotation runbook. GHCR pushes proceed on GITHUB_TOKEN even when Docker Hub is unavailable.
  • New ## Releasing section in README.md documents the secrets and the Docker Hub PAT rotation runbook.
  • Full incident write-up at docs/case-studies/issue-82/CASE-STUDY.md with captured CI logs (ci-logs/failed-25073172386.txt, ci-logs/run-25073172386-summary.json).
  • experiments/test-issue82-dockerhub-login-tolerance.sh static-checks the workflow invariants.
  • Patch-level changeset added.
  • PR CI run 25073668880success.
  • PR moved out of draft.

Remaining operational step (must be done by a repo admin, not in code): rotate the Docker Hub PAT at https://hub.docker.com/settings/security and update DOCKERHUB_TOKEN at https://github.com/link-foundation/box/settings/secrets/actions, then re-run run 25073172386.

Want me to /schedule a one-time agent in ~80 days to open a follow-up PR rotating the PAT (matching a 90-day expiry policy), so this incident doesn't repeat silently?


This summary was automatically extracted from the AI working session output.

@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 28, 2026

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $6.767294

📊 Context and tokens usage:

Claude Opus 4.7:

  • 135.7K / 1M (14%) input tokens, 29.7K / 128K (23%) output tokens

Total: (140.1K + 10.3M cached) input tokens, 29.7K output tokens, $6.765515 cost

Claude Haiku 4.5:

  • 107 / 64K (0%) output tokens

Total: 1.2K input tokens, 107 output tokens, $0.001779 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Main model: Claude Opus 4.7 (claude-opus-4-7)
  • Additional models:
    • Claude Haiku 4.5 (claude-haiku-4-5-20251001)

📎 Log file uploaded as Gist (2791KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 28, 2026

✅ Ready to merge

This pull request is now ready to be merged:

  • All CI checks have passed
  • No merge conflicts
  • No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

@konard konard marked this pull request as draft April 29, 2026 07:55
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🤖 AI Work Session Started

Starting automated work session at 2026-04-29T07:55:45.433Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

konard added 2 commits April 29, 2026 07:58
The PR-only docker-build-test smoke job was running out of disk on
ubuntu-24.04 while building JS -> essentials -> 11 language images ->
full-box sequentially on a single runner (PR run 25075335426 failed at
COPY --from=ruby-stage with "no space left on device", 0 MB free).

Add the same jlumbroso/free-disk-space step that docker-build-push
already uses (added in issue #41) to reclaim ~30 GB before building.

Also extend docs/case-studies/issue-82/CASE-STUDY.md with this
follow-up failure mode and capture the failing job log under
docs/case-studies/issue-82/ci-logs/.
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

Solution summary

Pushed the disk-space fix and updated the PR description. Background monitor bxttvnseu is watching CI run 25097532949 and will notify me when it completes (estimated ~30 min based on previous run). Once green, I'll mark the PR ready.


This summary was automatically extracted from the AI working session output.

@konard konard marked this pull request as ready for review April 29, 2026 08:01
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $2.647189

📊 Context and tokens usage:

  • 81.6K / 1M (8%) input tokens, 16.3K / 128K (13%) output tokens

Total: (92.0K + 3.3M cached) input tokens, 16.3K output tokens, $2.647189 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.7 (claude-opus-4-7)

📎 Log file uploaded as Gist (1583KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🚨 Solution Draft Failed

The automated solution draft encountered an error:

Session interrupted by user (CTRL+C)

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.7 (claude-opus-4-7)

📎 Failure log uploaded as Gist (1751KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

konard added a commit to link-assistant/hive-mind that referenced this pull request Apr 29, 2026
)

The /merge verbose log printed "PR #N has no CI checks yet - treating as
no_checks" and "PR #N has no CI check-runs yet, but X workflow run(s) were
triggered" during the legitimate 30-120s gap between GitHub registering a
workflow_run and publishing its check_runs. The classification was correct,
but the wording reads as "no CI is configured / nothing is happening" and
the workflow run listing shows only IDs, not URLs. On link-foundation/box#83
the user watched the loop for 22 minutes and Ctrl+C'd because they could
not verify what /merge was waiting on (and CI ultimately passed — see the
case study).

Changes:
- src/github-merge.lib.mjs:
  * getDetailedCIStatus / checkPRCIStatus reword the no_checks log to
    "has no check-runs or commit statuses registered yet (status=no_checks;
    race vs. no-CI distinction is decided downstream)" with the short SHA.
  * getWorkflowRunsForSha verbose listing now includes run.html_url, so the
    user can click through to the GitHub Actions page.
  * Normalized check-run / commit-status entries carry an html_url field
    (falling back to details_url / target_url).
- src/solve.auto-merge-helpers.lib.mjs::getMergeBlockers:
  * no_checks branch: reword verbose message; per-run [status/conclusion]
    + URL is logged for each workflow run; blocker `details` strings are
    "<name> [<status>] — <url>" so the user-facing
    "⏳ Waiting for CI: …" line in solve.auto-merge.lib.mjs (which joins
    details with commas) automatically picks up the URL.
  * pending branch: same enrichment for check-runs that exist but are
    still running/queued.
  * cancelled branch: details now include conclusion + URL.
- tests/test-misleading-merge-logs-1712.mjs:
  13 unit tests covering wording guard, blocker enrichment for the
  no_checks / pending / cancelled paths, regression guard for #1466,
  and the joined user-facing line format. All passing.
- docs/case-studies/issue-1712/README.md + raw-data/:
  Full case study with raw API snapshots (PR, workflow runs, check-runs),
  reconstructed timeline, root cause, fix description, verification on the
  original case (CI ultimately passed).
- .changeset/issue-1712-misleading-merge-logs.md: patch bump.
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

We need to double check we use for each docker image building separate virtual machine in GitHub Actions, and before we do it on every machine we should clean up maximum free space, we might use for that existing cleanup actions (latest versions of them).

I also updated DOCKERHUB_TOKEN, and we need to make sure all CI/CD are not skipped, and properly working in Pull Request's CI/CD. So all possible configuration of our docker images are tested, and for each we use parallel execution of tests to speed up iteration.

We need to ensure all changes are correct, consistent, validated, tested, logged and fully meet each and all discussed requirements in widest possible sense (check issue description and all comments in issue and in pull request, make sure each and every requirement listed before actually checking if they were addressed). Ensure all CI/CD checks pass.

@konard konard marked this pull request as draft April 29, 2026 14:58
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🤖 AI Work Session Started

Starting automated work session at 2026-04-29T14:58:56.013Z

The PR has been converted to draft mode while work is in progress.

This comment marks the beginning of an AI work session. Please wait for the session to finish, and provide your feedback.

…(issue #82)

Replace the single sequential docker-build-test job with a chain of
parallel matrix jobs so every Docker image configuration is tested on
its own VM with maximum free disk space:

  pr-test-js          1 VM
  pr-test-essentials  1 VM (needs pr-test-js)
  pr-test-language    matrix x 11 langs in parallel
  pr-test-full        1 VM, builds full chain locally for COPY --from
  pr-test-dind        matrix x 14 variants in parallel
  docker-build-test   aggregator for branch protection

Add jlumbroso/free-disk-space@main to every build job that lacked it
(11 jobs total now have it, was 3). Cross-job layer reuse uses
docker/build-push-action with cache-from/cache-to: type=gha.

Adds experiments/test-issue82-pr-parallel-tests.sh as a static-analysis
sanity check for the layout.
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

Solution summary

Free disk space is running on the JS test job. The structure is correct.

Status: pushed ac6d58c with the parallel matrix refactor + per-job free-disk-space + updated changeset and case study, PR description updated, both static-analysis test scripts pass, and run 25117130513 is in progress (pr-test / js currently in Free disk space step). The watch is running in the background and will notify when CI finishes.


This summary was automatically extracted from the AI working session output.

@konard konard marked this pull request as ready for review April 29, 2026 15:13
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $5.486414

📊 Context and tokens usage:

Claude Opus 4.7: (2 session segments)

  1. 116.8K / 1M (12%) input tokens, 41.9K / 128K (33%) output tokens
  2. 58.7K / 1M (6%) input tokens, 10.2K / 128K (8%) output tokens

Total: (170.2K + 6.0M cached) input tokens, 57.6K output tokens, $5.486414 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.7 (claude-opus-4-7)

📎 Log file uploaded as Gist (3404KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🔄 Auto-restart triggered (iteration 1)

Reason: CI failures detected; Uncommitted changes

Starting new session to address the issues.


Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

konard added 3 commits April 29, 2026 15:29
Previous attempt used docker/build-push-action@v5 with cache-from/to
type=gha for pr-test-js, pr-test-essentials, and pr-test-language.
That action defaults to the docker-container buildx driver, whose
isolated image store does NOT see images in the host Docker daemon —
so when essentials or a language Dockerfile resolves
`FROM box-js:pr`, buildx tries `docker.io/library/box-js:pr` and the
build fails with `pull access denied`.

Fix: switch all pr-test-* build steps to plain `docker build` against
the host daemon (the same pattern the original docker-build-test job
and pr-test-full / pr-test-dind already use). Each VM now rebuilds
its required base chain locally, which is fine because every VM has
30 GB freed up front and the parallel matrix bounds wall-clock by
the slowest single image.

Verified at run 25117130513:
  - pr-test/js: succeeded
  - pr-test/essentials: failed at `FROM box-js:pr` resolution
  - all downstream jobs skipped
…issue #82)

`kotlinc` is a shell wrapper around `java`, so the standalone
`box-kotlin` image (which never had Java installed) failed the new
per-language matrix test `docker run --rm box-kotlin kotlin -version`
with:

    /home/box/.sdkman/candidates/kotlin/current/bin/kotlinc:
    line 102: java: command not found

The full-box image was unaffected because Java was supplied by the
`box-java` build stage; the language-only image was simply never
exercised in isolation before the parallelized PR test matrix landed.

`ubuntu/24.04/kotlin/install.sh` now installs Java 21 LTS (Eclipse
Temurin, falling back to OpenJDK) via SDKMAN before installing
Kotlin, making the standalone `box-kotlin` image self-sufficient and
unblocking pr-test / kotlin.
Adds a section explaining that the new per-language pr-test matrix
caught a real regression in the standalone box-kotlin image (no Java
installed), motivating the install.sh fix.
@konard
Copy link
Copy Markdown
Member Author

konard commented Apr 29, 2026

🔄 Auto-restart-until-mergeable Log (iteration 1)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $6.566314

📊 Context and tokens usage:

Claude Opus 4.7: (2 session segments)

  1. 117.1K / 1M (12%) input tokens, 17.0K / 128K (13%) output tokens
  2. 73.0K / 1M (7%) input tokens, 13.6K / 128K (11%) output tokens

Total: (198.3K + 8.9M cached) input tokens, 35.6K output tokens, $6.566314 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.7 (claude-opus-4-7)

📎 Log file uploaded as Gist (6450KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard konard merged commit c7626c0 into main Apr 29, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix all CI/CD bugs

1 participant