ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate by saturley-hall · Pull Request #9263 · ai-dynamo/dynamo

saturley-hall · 2026-05-07T17:09:31Z

Overview:

Today any test/build failure in nightly-ci blocks the release job, so the GitLab pipeline never sees a partial nightly. This PR moves the policy decision out of GitHub Actions and into GitLab: always run publish + the GitLab trigger, forward an UPSTREAM_FAILED flag, and let the GitLab side require human approval before promoting (gate to be added on the GitLab side separately).

Details:

nightly-ci.yml (release job): drop !failure() from the if; add vllm-build / sglang-build / trtllm-build to needs so cascading skips surface as direct failures; forward upstream_failed (computed from contains(needs.*.result, 'failure' | 'cancelled')) into release.yml.
release.yml: new upstream_failed input on both workflow_dispatch and workflow_call. trigger-gitlab-release-pipeline's if switches to !cancelled() && needs.prepare-release.result == 'success' && inputs.skip_gitlab_pipeline != true so a failed release-publish or stage-wheels-artifactory no longer skips the trigger.
stage-wheels-artifactory: probes the dynamo-runtime-cuda12 ECR tag with crane manifest first; on miss it sets wheels_staged=false and exits 0 instead of failing the run. Upload step is gated on wheels_staged. Job exposes wheels_staged as an output.
release-publish: exposes successful_count / failed_count as job outputs (the inner script already collected them).
The composite HAD_FAILURES flag (input + publish result/failed_count + wheels result/staged) ships as the UPSTREAM_FAILED GitLab variable. ARTIFACTS is now built dynamically — only includes container / wheel / helm for things that actually staged.

Behavior reminder: a test-only failure (builds all green) still publishes containers to NGC and stages wheels to Artifactory; only the UPSTREAM_FAILED flag flips so GitLab can require review.

Where should the reviewer start?

.github/workflows/nightly-ci.yml — the release job's needs / if / with block.
.github/workflows/release.yml — stage-wheels-artifactory extract step (graceful miss) and trigger-gitlab-release-pipeline (composite flag + dynamic ARTIFACTS).

Test plan:

workflow_dispatch of nightly-ci with run_tests=false — verify release runs, UPSTREAM_FAILED=false reaches GitLab, ARTIFACTS=container,wheel.
Re-run a known-failing nightly via workflow_dispatch — verify release runs despite test failure, UPSTREAM_FAILED=true.
Simulate a build failure (or point at a SHA whose dynamo-runtime tag isn't in ECR) — verify stage-wheels-artifactory warns and exits 0, GitLab trigger fires with wheel dropped from ARTIFACTS.

Related Issues:

Closes OPS-5060

Summary by CodeRabbit

Chores
- Enhanced release workflow with improved failure detection and reporting during build and deployment processes.
- Refined wheel staging with better error handling to ensure more reliable release deployments.

…TREAM_FAILED gate Today any test/build failure in nightly-ci blocks the release job, so the GitLab pipeline never sees a partial nightly. Move the policy decision out of GitHub Actions and into GitLab: always run publish + the GitLab trigger, forward an UPSTREAM_FAILED flag, and let the GitLab side require human approval before promoting. - Drop !failure() from the release job's if; add the build jobs to needs so cascading skips surface as direct failures, and forward upstream_failed. - release.yml gains an upstream_failed input; trigger-gitlab-release-pipeline now runs on !cancelled() instead of implicit needs-success, so a failed release-publish or stage-wheels-artifactory no longer skips the trigger. - stage-wheels-artifactory probes the dynamo-runtime ECR tag with crane manifest first; on miss it sets wheels_staged=false and exits 0 instead of failing the run. Upload step is gated on wheels_staged. - ARTIFACTS sent to GitLab is now built dynamically — only includes container/wheel/helm for things that actually staged — and the composite HAD_FAILURES flag (input + publish result/failed_count + wheels result/staged) ships as the UPSTREAM_FAILED variable.

dynamo-ops · 2026-05-07T17:15:56Z

+          # failed in the caller), skip wheel staging instead of failing the
+          # whole release. The GitLab trigger drops `wheel` from ARTIFACTS
+          # accordingly and forwards UPSTREAM_FAILED for the human gate.
+          if ! crane manifest "${ECR_IMAGE}" >/dev/null 2>&1; then


The missing-image bypass is unconditional in the shared release workflow, so an RC/manual release can silently skip wheel staging and continue without wheel artifacts instead of failing closed. Fix: only exit successfully here for nightly/upstream-failed partial nightlies and fail for non-nightly releases.

dynamo-ops · 2026-05-07T17:15:56Z

+    # stage-wheels-artifactory partially failed (e.g. missing ECR images
+    # from a broken upstream build); UPSTREAM_FAILED is forwarded so the
+    # GitLab pipeline can require human approval.
+    if: ${{ !cancelled() && needs.prepare-release.result == 'success' && inputs.skip_gitlab_pipeline != true }}


The relaxed trigger condition also applies to RC releases, so GitLab can be triggered after release-publish or stage-wheels-artifactory failed with a partial artifact set. Fix: require both upstream jobs to succeed for non-nightly releases and use the !cancelled() partial-release path only for nightly.

coderabbitai · 2026-05-07T17:18:19Z

Walkthrough

The pull request updates GitHub Actions workflows to propagate upstream build/test failure state through the release pipeline to GitLab. Nightly CI computes failure status and passes it to the release workflow; release.yml adds input/output contracts for failure state, implements conditional wheel staging with ECR checks, and forwards status details to GitLab release triggers.

Changes

Release pipeline failure flow

Layer / File(s)	Summary
Workflow input and output contracts `.github/workflows/release.yml`	New `upstream_failed` boolean input added to `workflow_dispatch` and `workflow_call`. Job outputs added: `release-publish` outputs `successful_count` and `failed_count`; `stage-wheels-artifactory` outputs `wheels_staged`.
Wheel staging conditional logic `.github/workflows/release.yml`	ECR manifest check added to verify `dynamo-runtime` image exists; missing image sets `wheels_staged=false` with warning instead of failing. Artifactory upload step is conditional on `wheels_staged == 'true'`.
GitLab trigger condition and environment `.github/workflows/release.yml`	GitLab trigger job condition relaxes from failure blocking to `!cancelled()` check. Environment variables `UPSTREAM_FAILED`, `PUBLISH_RESULT`, `PUBLISH_FAILED_COUNT`, `WHEELS_RESULT`, and `WHEELS_STAGED` are passed to GitLab. Dynamic `HAD_FAILURES` flag computation replaces fixed artifact-selection logic.
Nightly CI upstream failure propagation `.github/workflows/nightly-ci.yml`	Release job now depends on `vllm-build`, `sglang-build`, and `trtllm-build` build image jobs. Release workflow call condition changes from `!failure()` to `!cancelled() && release==true` with `upstream_failed` input computed from any upstream job failure or cancellation status.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: publishing and GitLab trigger now run on failures with an UPSTREAM_FAILED flag forwarded to GitLab for decision-making.
Description check	✅ Passed	The description follows the template with all required sections (Overview, Details, Where to start, Related Issues) and provides comprehensive details about the changes and test plan.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-ci.yml:
- Around line 440-446: The upstream_failed boolean is missing transitive
"skipped" cases so cascading infrastructure skips (e.g., when
create-fresh-builder or resolve-source-sha fails) result in
upstream_failed=false and still trigger the release workflow; fix by ensuring
skipped results from essential jobs are treated as failures: either add
create-fresh-builder and resolve-source-sha to the release.needs list so their
failure surfaces directly, or expand the upstream_failed expression to include
'skipped' only for the specific always-run build jobs (vllm-build, sglang-build,
trtllm-build, dynamo-pipeline, resolve-source-sha) instead of globally adding
'skipped' (i.e., update the contains(...) checks to explicitly test those job
names' results or add those job names to needs used by upstream_failed).

In @.github/workflows/release.yml:
- Around line 210-212: The script currently aborts on the first failing copy
because set -euo pipefail makes bare failures in copy_image() fatal, preventing
the copy_images step from writing its summary and leaving
release-publish.outputs.failed_count empty; fix by making copy_image()
non-fatal: remove/avoid returning non-zero on failures and instead record
failures into the existing failure-tracking arrays (e.g., push to
PUBLISH_FAILED_LIST or HAD_FAILURES array) and always return 0, or keep return 1
but change every caller invocation of copy_image (the many bare calls) to invoke
it with "|| true" so the loop continues; ensure the summary-writing logic in the
copy_images step runs unconditionally so outputs.successful_count and
outputs.failed_count are populated even when some copies fail.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: daf4225d-edf7-428a-8134-8dd0c1085b21

📥 Commits

Reviewing files that changed from the base of the PR and between f5a9463 and 2082fe5.

📒 Files selected for processing (2)

.github/workflows/nightly-ci.yml
.github/workflows/release.yml

coderabbitai · 2026-05-07T17:18:22Z

+    if: ${{ !cancelled() && needs.compute-release-mode.outputs.release == 'true' }}
    uses: ./.github/workflows/release.yml
    with:
      commit_sha: ${{ needs.resolve-source-sha.outputs.source_sha }}
      nightly: true
      skip_gitlab_pipeline: ${{ inputs.skip_gitlab_pipeline || false }}
+      upstream_failed: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

upstream_failed misses cascading skipped results, so silent infrastructure failures slip through as UPSTREAM_FAILED=false.

contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') on Line 446 only catches direct failure/cancellation. But several jobs in the release.needs list (vllm-build, sglang-build, trtllm-build, dynamo-pipeline) transitively depend on create-fresh-builder, which is not in release.needs. If create-fresh-builder (or resolve-source-sha) fails, those build jobs cascade-skip and report result == 'skipped', not 'failure'. Combined with if: !cancelled() && release == 'true', the release call still fires with upstream_failed=false, and GitLab will treat the partial nightly as fully validated even though nothing was actually built/tested in this run.

You can't blanket-add 'skipped' because run_tests=false intentionally skips the test jobs. A targeted fix is to flag 'skipped' only for jobs that should always run (the build jobs / dynamo-pipeline / resolve-source-sha), or to add create-fresh-builder and resolve-source-sha to release.needs so their failures surface directly.

🛠️ Sketch — flag skipped builds explicitly

skip_gitlab_pipeline: ${{ inputs.skip_gitlab_pipeline || false }} - upstream_failed: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }} + upstream_failed: >- + ${{ contains(needs.*.result, 'failure') + || contains(needs.*.result, 'cancelled') + || needs.vllm-build.result == 'skipped' + || needs.sglang-build.result == 'skipped' + || needs.trtllm-build.result == 'skipped' + || needs.dynamo-pipeline.result == 'skipped' + || needs.resolve-source-sha.result == 'skipped' }}

Test-suite jobs are deliberately excluded so run_tests=false dispatches don't spuriously flip the flag.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/nightly-ci.yml around lines 440 - 446, The upstream_failed boolean is missing transitive "skipped" cases so cascading infrastructure skips (e.g., when create-fresh-builder or resolve-source-sha fails) result in upstream_failed=false and still trigger the release workflow; fix by ensuring skipped results from essential jobs are treated as failures: either add create-fresh-builder and resolve-source-sha to the release.needs list so their failure surfaces directly, or expand the upstream_failed expression to include 'skipped' only for the specific always-run build jobs (vllm-build, sglang-build, trtllm-build, dynamo-pipeline, resolve-source-sha) instead of globally adding 'skipped' (i.e., update the contains(...) checks to explicitly test those job names' results or add those job names to needs used by upstream_failed).

coderabbitai · 2026-05-07T17:18:23Z

+    outputs:
+      successful_count: ${{ steps.copy_images.outputs.successful_count }}
+      failed_count: ${{ steps.copy_images.outputs.failed_count }}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify copy_image is invoked as a bare command (so set -e aborts on return 1) # vs in an `if`/`||`/`&&` context. Expect: all call sites bare. rg -nP -C1 '\bcopy_image\s+' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 1916

🏁 Script executed:

# First, verify the set -euo pipefail setting and surrounding context sed -n '255,270p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 751

🏁 Script executed:

# Verify the copy_image function definition and its return statements sed -n '272,290p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 806

🏁 Script executed:

# Verify the summary output write lines sed -n '373,378p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 317

🏁 Script executed:

# Check the trigger-gitlab-release-pipeline section with the failed_count check sed -n '690,710p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 1278

failed_count output is never populated; first failed copy aborts publishing under set -e.

The release-publish.outputs.failed_count (Lines 211-212) reads from the copy_images step's summary (Lines 375-376), but because copy_image() returns 1 on failure (Line 283) and is invoked as a bare command in all loops (Lines 296, 312, 330, 335, 340, 347, 354, 361, 368), the set -euo pipefail on Line 258 exits the script immediately on the first failure — before the summary writes execute.

This breaks the PR objective in three ways:

The failed_count output is never set to a non-zero value (only ever empty on failure), making the [ "${PUBLISH_FAILED_COUNT}" != "0" ] check on Line 696 dead code.

The "Warning: ... skipping" message on Line 282 is misleading — the script aborts, not skips.

Partial publishing fails: any missing/broken image (e.g., a stray EFA tag) aborts the whole publish, marks the job failed, excludes all containers from ARTIFACTS_LIST (Lines 703-705), and GitLab never learns about images that copied successfully before the failure.

HAD_FAILURES still flips correctly via PUBLISH_RESULT != "success" (Line 697), so the gate signal isn't lost — but the actual partial-publish behavior described in the PR is unreachable.

Fix: Make copy_image() non-fatal by always returning 0, while tracking failures in the arrays:

Suggested fix

copy_image() { local SRC="$1" DST="$2" LABEL="$3" echo "----------------------------------------" echo "Copying: ${LABEL}" # crane copy preserves multi-arch manifest lists by default (no --platform needed) if crane copy "${SRC}" "${DST}"; then echo " Copied: ${LABEL}" SUCCESSFUL_COPIES+=("${LABEL}") - return 0 else echo " Warning: Failed to copy ${LABEL}, skipping..." FAILED_COPIES+=("${LABEL}") - return 1 fi + return 0 }

If callers need to see non-zero return for logging, keep return 1 in the else branch and invoke as copy_image ... || true at every call site instead.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/release.yml around lines 210 - 212, The script currently aborts on the first failing copy because set -euo pipefail makes bare failures in copy_image() fatal, preventing the copy_images step from writing its summary and leaving release-publish.outputs.failed_count empty; fix by making copy_image() non-fatal: remove/avoid returning non-zero on failures and instead record failures into the existing failure-tracking arrays (e.g., push to PUBLISH_FAILED_LIST or HAD_FAILURES array) and always return 0, or keep return 1 but change every caller invocation of copy_image (the many bare calls) to invoke it with "|| true" so the loop continues; ensure the summary-writing logic in the copy_images step runs unconditionally so outputs.successful_count and outputs.failed_count are populated even when some copies fail.

saturley-hall requested a review from a team as a code owner May 7, 2026 17:09

pull-request-size Bot added the size/M label May 7, 2026

github-actions Bot added ci Issues/PRs that reference CI build/test actions labels May 7, 2026

saturley-hall requested a review from nv-anants May 7, 2026 17:11

dynamo-ops reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

nv-anants approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate#9263

ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate#9263
saturley-hall wants to merge 1 commit intomainfrom
harrison/nightly-publishes-with-failure

saturley-hall commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

dynamo-ops May 7, 2026

Uh oh!

dynamo-ops May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saturley-hall commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Test plan:

Related Issues:

Summary by CodeRabbit

Uh oh!

dynamo-ops May 7, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 7, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saturley-hall commented May 7, 2026 •

edited by coderabbitai Bot

Loading