Skip to content

ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate#9263

Open
saturley-hall wants to merge 1 commit intomainfrom
harrison/nightly-publishes-with-failure
Open

ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate#9263
saturley-hall wants to merge 1 commit intomainfrom
harrison/nightly-publishes-with-failure

Conversation

@saturley-hall
Copy link
Copy Markdown
Member

@saturley-hall saturley-hall commented May 7, 2026

Overview:

Today any test/build failure in nightly-ci blocks the release job, so the GitLab pipeline never sees a partial nightly. This PR moves the policy decision out of GitHub Actions and into GitLab: always run publish + the GitLab trigger, forward an UPSTREAM_FAILED flag, and let the GitLab side require human approval before promoting (gate to be added on the GitLab side separately).

Details:

  • nightly-ci.yml (release job): drop !failure() from the if; add vllm-build / sglang-build / trtllm-build to needs so cascading skips surface as direct failures; forward upstream_failed (computed from contains(needs.*.result, 'failure' | 'cancelled')) into release.yml.
  • release.yml: new upstream_failed input on both workflow_dispatch and workflow_call. trigger-gitlab-release-pipeline's if switches to !cancelled() && needs.prepare-release.result == 'success' && inputs.skip_gitlab_pipeline != true so a failed release-publish or stage-wheels-artifactory no longer skips the trigger.
  • stage-wheels-artifactory: probes the dynamo-runtime-cuda12 ECR tag with crane manifest first; on miss it sets wheels_staged=false and exits 0 instead of failing the run. Upload step is gated on wheels_staged. Job exposes wheels_staged as an output.
  • release-publish: exposes successful_count / failed_count as job outputs (the inner script already collected them).
  • The composite HAD_FAILURES flag (input + publish result/failed_count + wheels result/staged) ships as the UPSTREAM_FAILED GitLab variable. ARTIFACTS is now built dynamically — only includes container / wheel / helm for things that actually staged.

Behavior reminder: a test-only failure (builds all green) still publishes containers to NGC and stages wheels to Artifactory; only the UPSTREAM_FAILED flag flips so GitLab can require review.

Where should the reviewer start?

  • .github/workflows/nightly-ci.yml — the release job's needs / if / with block.
  • .github/workflows/release.ymlstage-wheels-artifactory extract step (graceful miss) and trigger-gitlab-release-pipeline (composite flag + dynamic ARTIFACTS).

Test plan:

  • workflow_dispatch of nightly-ci with run_tests=false — verify release runs, UPSTREAM_FAILED=false reaches GitLab, ARTIFACTS=container,wheel.
  • Re-run a known-failing nightly via workflow_dispatch — verify release runs despite test failure, UPSTREAM_FAILED=true.
  • Simulate a build failure (or point at a SHA whose dynamo-runtime tag isn't in ECR) — verify stage-wheels-artifactory warns and exits 0, GitLab trigger fires with wheel dropped from ARTIFACTS.

Related Issues:

  • Closes OPS-5060

Open in Devin Review

Summary by CodeRabbit

  • Chores
    • Enhanced release workflow with improved failure detection and reporting during build and deployment processes.
    • Refined wheel staging with better error handling to ensure more reliable release deployments.

…TREAM_FAILED gate

Today any test/build failure in nightly-ci blocks the release job, so the
GitLab pipeline never sees a partial nightly. Move the policy decision out
of GitHub Actions and into GitLab: always run publish + the GitLab trigger,
forward an UPSTREAM_FAILED flag, and let the GitLab side require human
approval before promoting.

- Drop !failure() from the release job's if; add the build jobs to needs so
  cascading skips surface as direct failures, and forward upstream_failed.
- release.yml gains an upstream_failed input; trigger-gitlab-release-pipeline
  now runs on !cancelled() instead of implicit needs-success, so a failed
  release-publish or stage-wheels-artifactory no longer skips the trigger.
- stage-wheels-artifactory probes the dynamo-runtime ECR tag with crane
  manifest first; on miss it sets wheels_staged=false and exits 0 instead
  of failing the run. Upload step is gated on wheels_staged.
- ARTIFACTS sent to GitLab is now built dynamically — only includes
  container/wheel/helm for things that actually staged — and the composite
  HAD_FAILURES flag (input + publish result/failed_count + wheels
  result/staged) ships as the UPSTREAM_FAILED variable.
@saturley-hall saturley-hall requested a review from a team as a code owner May 7, 2026 17:09
@github-actions github-actions Bot added ci Issues/PRs that reference CI build/test actions labels May 7, 2026
@saturley-hall saturley-hall requested a review from nv-anants May 7, 2026 17:11
# failed in the caller), skip wheel staging instead of failing the
# whole release. The GitLab trigger drops `wheel` from ARTIFACTS
# accordingly and forwards UPSTREAM_FAILED for the human gate.
if ! crane manifest "${ECR_IMAGE}" >/dev/null 2>&1; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing-image bypass is unconditional in the shared release workflow, so an RC/manual release can silently skip wheel staging and continue without wheel artifacts instead of failing closed. Fix: only exit successfully here for nightly/upstream-failed partial nightlies and fail for non-nightly releases.

# stage-wheels-artifactory partially failed (e.g. missing ECR images
# from a broken upstream build); UPSTREAM_FAILED is forwarded so the
# GitLab pipeline can require human approval.
if: ${{ !cancelled() && needs.prepare-release.result == 'success' && inputs.skip_gitlab_pipeline != true }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relaxed trigger condition also applies to RC releases, so GitLab can be triggered after release-publish or stage-wheels-artifactory failed with a partial artifact set. Fix: require both upstream jobs to succeed for non-nightly releases and use the !cancelled() partial-release path only for nightly.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

The pull request updates GitHub Actions workflows to propagate upstream build/test failure state through the release pipeline to GitLab. Nightly CI computes failure status and passes it to the release workflow; release.yml adds input/output contracts for failure state, implements conditional wheel staging with ECR checks, and forwards status details to GitLab release triggers.

Changes

Release pipeline failure flow

Layer / File(s) Summary
Workflow input and output contracts
.github/workflows/release.yml
New upstream_failed boolean input added to workflow_dispatch and workflow_call. Job outputs added: release-publish outputs successful_count and failed_count; stage-wheels-artifactory outputs wheels_staged.
Wheel staging conditional logic
.github/workflows/release.yml
ECR manifest check added to verify dynamo-runtime image exists; missing image sets wheels_staged=false with warning instead of failing. Artifactory upload step is conditional on wheels_staged == 'true'.
GitLab trigger condition and environment
.github/workflows/release.yml
GitLab trigger job condition relaxes from failure blocking to !cancelled() check. Environment variables UPSTREAM_FAILED, PUBLISH_RESULT, PUBLISH_FAILED_COUNT, WHEELS_RESULT, and WHEELS_STAGED are passed to GitLab. Dynamic HAD_FAILURES flag computation replaces fixed artifact-selection logic.
Nightly CI upstream failure propagation
.github/workflows/nightly-ci.yml
Release job now depends on vllm-build, sglang-build, and trtllm-build build image jobs. Release workflow call condition changes from !failure() to !cancelled() && release==true with upstream_failed input computed from any upstream job failure or cancellation status.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: publishing and GitLab trigger now run on failures with an UPSTREAM_FAILED flag forwarded to GitLab for decision-making.
Description check ✅ Passed The description follows the template with all required sections (Overview, Details, Where to start, Related Issues) and provides comprehensive details about the changes and test plan.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-ci.yml:
- Around line 440-446: The upstream_failed boolean is missing transitive
"skipped" cases so cascading infrastructure skips (e.g., when
create-fresh-builder or resolve-source-sha fails) result in
upstream_failed=false and still trigger the release workflow; fix by ensuring
skipped results from essential jobs are treated as failures: either add
create-fresh-builder and resolve-source-sha to the release.needs list so their
failure surfaces directly, or expand the upstream_failed expression to include
'skipped' only for the specific always-run build jobs (vllm-build, sglang-build,
trtllm-build, dynamo-pipeline, resolve-source-sha) instead of globally adding
'skipped' (i.e., update the contains(...) checks to explicitly test those job
names' results or add those job names to needs used by upstream_failed).

In @.github/workflows/release.yml:
- Around line 210-212: The script currently aborts on the first failing copy
because set -euo pipefail makes bare failures in copy_image() fatal, preventing
the copy_images step from writing its summary and leaving
release-publish.outputs.failed_count empty; fix by making copy_image()
non-fatal: remove/avoid returning non-zero on failures and instead record
failures into the existing failure-tracking arrays (e.g., push to
PUBLISH_FAILED_LIST or HAD_FAILURES array) and always return 0, or keep return 1
but change every caller invocation of copy_image (the many bare calls) to invoke
it with "|| true" so the loop continues; ensure the summary-writing logic in the
copy_images step runs unconditionally so outputs.successful_count and
outputs.failed_count are populated even when some copies fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: daf4225d-edf7-428a-8134-8dd0c1085b21

📥 Commits

Reviewing files that changed from the base of the PR and between f5a9463 and 2082fe5.

📒 Files selected for processing (2)
  • .github/workflows/nightly-ci.yml
  • .github/workflows/release.yml

Comment on lines +440 to +446
if: ${{ !cancelled() && needs.compute-release-mode.outputs.release == 'true' }}
uses: ./.github/workflows/release.yml
with:
commit_sha: ${{ needs.resolve-source-sha.outputs.source_sha }}
nightly: true
skip_gitlab_pipeline: ${{ inputs.skip_gitlab_pipeline || false }}
upstream_failed: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

upstream_failed misses cascading skipped results, so silent infrastructure failures slip through as UPSTREAM_FAILED=false.

contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') on Line 446 only catches direct failure/cancellation. But several jobs in the release.needs list (vllm-build, sglang-build, trtllm-build, dynamo-pipeline) transitively depend on create-fresh-builder, which is not in release.needs. If create-fresh-builder (or resolve-source-sha) fails, those build jobs cascade-skip and report result == 'skipped', not 'failure'. Combined with if: !cancelled() && release == 'true', the release call still fires with upstream_failed=false, and GitLab will treat the partial nightly as fully validated even though nothing was actually built/tested in this run.

You can't blanket-add 'skipped' because run_tests=false intentionally skips the test jobs. A targeted fix is to flag 'skipped' only for jobs that should always run (the build jobs / dynamo-pipeline / resolve-source-sha), or to add create-fresh-builder and resolve-source-sha to release.needs so their failures surface directly.

🛠️ Sketch — flag skipped builds explicitly
       skip_gitlab_pipeline: ${{ inputs.skip_gitlab_pipeline || false }}
-      upstream_failed: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }}
+      upstream_failed: >-
+        ${{ contains(needs.*.result, 'failure')
+            || contains(needs.*.result, 'cancelled')
+            || needs.vllm-build.result == 'skipped'
+            || needs.sglang-build.result == 'skipped'
+            || needs.trtllm-build.result == 'skipped'
+            || needs.dynamo-pipeline.result == 'skipped'
+            || needs.resolve-source-sha.result == 'skipped' }}

Test-suite jobs are deliberately excluded so run_tests=false dispatches don't spuriously flip the flag.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-ci.yml around lines 440 - 446, The upstream_failed
boolean is missing transitive "skipped" cases so cascading infrastructure skips
(e.g., when create-fresh-builder or resolve-source-sha fails) result in
upstream_failed=false and still trigger the release workflow; fix by ensuring
skipped results from essential jobs are treated as failures: either add
create-fresh-builder and resolve-source-sha to the release.needs list so their
failure surfaces directly, or expand the upstream_failed expression to include
'skipped' only for the specific always-run build jobs (vllm-build, sglang-build,
trtllm-build, dynamo-pipeline, resolve-source-sha) instead of globally adding
'skipped' (i.e., update the contains(...) checks to explicitly test those job
names' results or add those job names to needs used by upstream_failed).

Comment on lines +210 to +212
outputs:
successful_count: ${{ steps.copy_images.outputs.successful_count }}
failed_count: ${{ steps.copy_images.outputs.failed_count }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify copy_image is invoked as a bare command (so set -e aborts on return 1)
# vs in an `if`/`||`/`&&` context. Expect: all call sites bare.
rg -nP -C1 '\bcopy_image\s+' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 1916


🏁 Script executed:

# First, verify the set -euo pipefail setting and surrounding context
sed -n '255,270p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 751


🏁 Script executed:

# Verify the copy_image function definition and its return statements
sed -n '272,290p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 806


🏁 Script executed:

# Verify the summary output write lines
sed -n '373,378p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 317


🏁 Script executed:

# Check the trigger-gitlab-release-pipeline section with the failed_count check
sed -n '690,710p' .github/workflows/release.yml

Repository: ai-dynamo/dynamo

Length of output: 1278


failed_count output is never populated; first failed copy aborts publishing under set -e.

The release-publish.outputs.failed_count (Lines 211-212) reads from the copy_images step's summary (Lines 375-376), but because copy_image() returns 1 on failure (Line 283) and is invoked as a bare command in all loops (Lines 296, 312, 330, 335, 340, 347, 354, 361, 368), the set -euo pipefail on Line 258 exits the script immediately on the first failure — before the summary writes execute.

This breaks the PR objective in three ways:

  • The failed_count output is never set to a non-zero value (only ever empty on failure), making the [ "${PUBLISH_FAILED_COUNT}" != "0" ] check on Line 696 dead code.
  • The "Warning: ... skipping" message on Line 282 is misleading — the script aborts, not skips.
  • Partial publishing fails: any missing/broken image (e.g., a stray EFA tag) aborts the whole publish, marks the job failed, excludes all containers from ARTIFACTS_LIST (Lines 703-705), and GitLab never learns about images that copied successfully before the failure.

HAD_FAILURES still flips correctly via PUBLISH_RESULT != "success" (Line 697), so the gate signal isn't lost — but the actual partial-publish behavior described in the PR is unreachable.

Fix: Make copy_image() non-fatal by always returning 0, while tracking failures in the arrays:

Suggested fix
          copy_image() {
            local SRC="$1" DST="$2" LABEL="$3"
            echo "----------------------------------------"
            echo "Copying: ${LABEL}"
            # crane copy preserves multi-arch manifest lists by default (no --platform needed)
            if crane copy "${SRC}" "${DST}"; then
              echo "  Copied: ${LABEL}"
              SUCCESSFUL_COPIES+=("${LABEL}")
-             return 0
            else
              echo "  Warning: Failed to copy ${LABEL}, skipping..."
              FAILED_COPIES+=("${LABEL}")
-             return 1
            fi
+           return 0
          }

If callers need to see non-zero return for logging, keep return 1 in the else branch and invoke as copy_image ... || true at every call site instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/release.yml around lines 210 - 212, The script currently
aborts on the first failing copy because set -euo pipefail makes bare failures
in copy_image() fatal, preventing the copy_images step from writing its summary
and leaving release-publish.outputs.failed_count empty; fix by making
copy_image() non-fatal: remove/avoid returning non-zero on failures and instead
record failures into the existing failure-tracking arrays (e.g., push to
PUBLISH_FAILED_LIST or HAD_FAILURES array) and always return 0, or keep return 1
but change every caller invocation of copy_image (the many bare calls) to invoke
it with "|| true" so the loop continues; ensure the summary-writing logic in the
copy_images step runs unconditionally so outputs.successful_count and
outputs.failed_count are populated even when some copies fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions ci Issues/PRs that reference CI build/test size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants