ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate#9263
ci(nightly): publish + trigger GitLab on test/build failures with UPSTREAM_FAILED gate#9263saturley-hall wants to merge 1 commit intomainfrom
Conversation
…TREAM_FAILED gate Today any test/build failure in nightly-ci blocks the release job, so the GitLab pipeline never sees a partial nightly. Move the policy decision out of GitHub Actions and into GitLab: always run publish + the GitLab trigger, forward an UPSTREAM_FAILED flag, and let the GitLab side require human approval before promoting. - Drop !failure() from the release job's if; add the build jobs to needs so cascading skips surface as direct failures, and forward upstream_failed. - release.yml gains an upstream_failed input; trigger-gitlab-release-pipeline now runs on !cancelled() instead of implicit needs-success, so a failed release-publish or stage-wheels-artifactory no longer skips the trigger. - stage-wheels-artifactory probes the dynamo-runtime ECR tag with crane manifest first; on miss it sets wheels_staged=false and exits 0 instead of failing the run. Upload step is gated on wheels_staged. - ARTIFACTS sent to GitLab is now built dynamically — only includes container/wheel/helm for things that actually staged — and the composite HAD_FAILURES flag (input + publish result/failed_count + wheels result/staged) ships as the UPSTREAM_FAILED variable.
| # failed in the caller), skip wheel staging instead of failing the | ||
| # whole release. The GitLab trigger drops `wheel` from ARTIFACTS | ||
| # accordingly and forwards UPSTREAM_FAILED for the human gate. | ||
| if ! crane manifest "${ECR_IMAGE}" >/dev/null 2>&1; then |
There was a problem hiding this comment.
The missing-image bypass is unconditional in the shared release workflow, so an RC/manual release can silently skip wheel staging and continue without wheel artifacts instead of failing closed. Fix: only exit successfully here for nightly/upstream-failed partial nightlies and fail for non-nightly releases.
| # stage-wheels-artifactory partially failed (e.g. missing ECR images | ||
| # from a broken upstream build); UPSTREAM_FAILED is forwarded so the | ||
| # GitLab pipeline can require human approval. | ||
| if: ${{ !cancelled() && needs.prepare-release.result == 'success' && inputs.skip_gitlab_pipeline != true }} |
There was a problem hiding this comment.
The relaxed trigger condition also applies to RC releases, so GitLab can be triggered after release-publish or stage-wheels-artifactory failed with a partial artifact set. Fix: require both upstream jobs to succeed for non-nightly releases and use the !cancelled() partial-release path only for nightly.
WalkthroughThe pull request updates GitHub Actions workflows to propagate upstream build/test failure state through the release pipeline to GitLab. Nightly CI computes failure status and passes it to the release workflow; release.yml adds input/output contracts for failure state, implements conditional wheel staging with ECR checks, and forwards status details to GitLab release triggers. ChangesRelease pipeline failure flow
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly-ci.yml:
- Around line 440-446: The upstream_failed boolean is missing transitive
"skipped" cases so cascading infrastructure skips (e.g., when
create-fresh-builder or resolve-source-sha fails) result in
upstream_failed=false and still trigger the release workflow; fix by ensuring
skipped results from essential jobs are treated as failures: either add
create-fresh-builder and resolve-source-sha to the release.needs list so their
failure surfaces directly, or expand the upstream_failed expression to include
'skipped' only for the specific always-run build jobs (vllm-build, sglang-build,
trtllm-build, dynamo-pipeline, resolve-source-sha) instead of globally adding
'skipped' (i.e., update the contains(...) checks to explicitly test those job
names' results or add those job names to needs used by upstream_failed).
In @.github/workflows/release.yml:
- Around line 210-212: The script currently aborts on the first failing copy
because set -euo pipefail makes bare failures in copy_image() fatal, preventing
the copy_images step from writing its summary and leaving
release-publish.outputs.failed_count empty; fix by making copy_image()
non-fatal: remove/avoid returning non-zero on failures and instead record
failures into the existing failure-tracking arrays (e.g., push to
PUBLISH_FAILED_LIST or HAD_FAILURES array) and always return 0, or keep return 1
but change every caller invocation of copy_image (the many bare calls) to invoke
it with "|| true" so the loop continues; ensure the summary-writing logic in the
copy_images step runs unconditionally so outputs.successful_count and
outputs.failed_count are populated even when some copies fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: daf4225d-edf7-428a-8134-8dd0c1085b21
📒 Files selected for processing (2)
.github/workflows/nightly-ci.yml.github/workflows/release.yml
| if: ${{ !cancelled() && needs.compute-release-mode.outputs.release == 'true' }} | ||
| uses: ./.github/workflows/release.yml | ||
| with: | ||
| commit_sha: ${{ needs.resolve-source-sha.outputs.source_sha }} | ||
| nightly: true | ||
| skip_gitlab_pipeline: ${{ inputs.skip_gitlab_pipeline || false }} | ||
| upstream_failed: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }} |
There was a problem hiding this comment.
upstream_failed misses cascading skipped results, so silent infrastructure failures slip through as UPSTREAM_FAILED=false.
contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') on Line 446 only catches direct failure/cancellation. But several jobs in the release.needs list (vllm-build, sglang-build, trtllm-build, dynamo-pipeline) transitively depend on create-fresh-builder, which is not in release.needs. If create-fresh-builder (or resolve-source-sha) fails, those build jobs cascade-skip and report result == 'skipped', not 'failure'. Combined with if: !cancelled() && release == 'true', the release call still fires with upstream_failed=false, and GitLab will treat the partial nightly as fully validated even though nothing was actually built/tested in this run.
You can't blanket-add 'skipped' because run_tests=false intentionally skips the test jobs. A targeted fix is to flag 'skipped' only for jobs that should always run (the build jobs / dynamo-pipeline / resolve-source-sha), or to add create-fresh-builder and resolve-source-sha to release.needs so their failures surface directly.
🛠️ Sketch — flag skipped builds explicitly
skip_gitlab_pipeline: ${{ inputs.skip_gitlab_pipeline || false }}
- upstream_failed: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }}
+ upstream_failed: >-
+ ${{ contains(needs.*.result, 'failure')
+ || contains(needs.*.result, 'cancelled')
+ || needs.vllm-build.result == 'skipped'
+ || needs.sglang-build.result == 'skipped'
+ || needs.trtllm-build.result == 'skipped'
+ || needs.dynamo-pipeline.result == 'skipped'
+ || needs.resolve-source-sha.result == 'skipped' }}Test-suite jobs are deliberately excluded so run_tests=false dispatches don't spuriously flip the flag.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/nightly-ci.yml around lines 440 - 446, The upstream_failed
boolean is missing transitive "skipped" cases so cascading infrastructure skips
(e.g., when create-fresh-builder or resolve-source-sha fails) result in
upstream_failed=false and still trigger the release workflow; fix by ensuring
skipped results from essential jobs are treated as failures: either add
create-fresh-builder and resolve-source-sha to the release.needs list so their
failure surfaces directly, or expand the upstream_failed expression to include
'skipped' only for the specific always-run build jobs (vllm-build, sglang-build,
trtllm-build, dynamo-pipeline, resolve-source-sha) instead of globally adding
'skipped' (i.e., update the contains(...) checks to explicitly test those job
names' results or add those job names to needs used by upstream_failed).
| outputs: | ||
| successful_count: ${{ steps.copy_images.outputs.successful_count }} | ||
| failed_count: ${{ steps.copy_images.outputs.failed_count }} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify copy_image is invoked as a bare command (so set -e aborts on return 1)
# vs in an `if`/`||`/`&&` context. Expect: all call sites bare.
rg -nP -C1 '\bcopy_image\s+' .github/workflows/release.ymlRepository: ai-dynamo/dynamo
Length of output: 1916
🏁 Script executed:
# First, verify the set -euo pipefail setting and surrounding context
sed -n '255,270p' .github/workflows/release.ymlRepository: ai-dynamo/dynamo
Length of output: 751
🏁 Script executed:
# Verify the copy_image function definition and its return statements
sed -n '272,290p' .github/workflows/release.ymlRepository: ai-dynamo/dynamo
Length of output: 806
🏁 Script executed:
# Verify the summary output write lines
sed -n '373,378p' .github/workflows/release.ymlRepository: ai-dynamo/dynamo
Length of output: 317
🏁 Script executed:
# Check the trigger-gitlab-release-pipeline section with the failed_count check
sed -n '690,710p' .github/workflows/release.ymlRepository: ai-dynamo/dynamo
Length of output: 1278
failed_count output is never populated; first failed copy aborts publishing under set -e.
The release-publish.outputs.failed_count (Lines 211-212) reads from the copy_images step's summary (Lines 375-376), but because copy_image() returns 1 on failure (Line 283) and is invoked as a bare command in all loops (Lines 296, 312, 330, 335, 340, 347, 354, 361, 368), the set -euo pipefail on Line 258 exits the script immediately on the first failure — before the summary writes execute.
This breaks the PR objective in three ways:
- The
failed_countoutput is never set to a non-zero value (only ever empty on failure), making the[ "${PUBLISH_FAILED_COUNT}" != "0" ]check on Line 696 dead code. - The "Warning: ... skipping" message on Line 282 is misleading — the script aborts, not skips.
- Partial publishing fails: any missing/broken image (e.g., a stray EFA tag) aborts the whole publish, marks the job failed, excludes all containers from
ARTIFACTS_LIST(Lines 703-705), and GitLab never learns about images that copied successfully before the failure.
HAD_FAILURES still flips correctly via PUBLISH_RESULT != "success" (Line 697), so the gate signal isn't lost — but the actual partial-publish behavior described in the PR is unreachable.
Fix: Make copy_image() non-fatal by always returning 0, while tracking failures in the arrays:
Suggested fix
copy_image() {
local SRC="$1" DST="$2" LABEL="$3"
echo "----------------------------------------"
echo "Copying: ${LABEL}"
# crane copy preserves multi-arch manifest lists by default (no --platform needed)
if crane copy "${SRC}" "${DST}"; then
echo " Copied: ${LABEL}"
SUCCESSFUL_COPIES+=("${LABEL}")
- return 0
else
echo " Warning: Failed to copy ${LABEL}, skipping..."
FAILED_COPIES+=("${LABEL}")
- return 1
fi
+ return 0
}If callers need to see non-zero return for logging, keep return 1 in the else branch and invoke as copy_image ... || true at every call site instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/release.yml around lines 210 - 212, The script currently
aborts on the first failing copy because set -euo pipefail makes bare failures
in copy_image() fatal, preventing the copy_images step from writing its summary
and leaving release-publish.outputs.failed_count empty; fix by making
copy_image() non-fatal: remove/avoid returning non-zero on failures and instead
record failures into the existing failure-tracking arrays (e.g., push to
PUBLISH_FAILED_LIST or HAD_FAILURES array) and always return 0, or keep return 1
but change every caller invocation of copy_image (the many bare calls) to invoke
it with "|| true" so the loop continues; ensure the summary-writing logic in the
copy_images step runs unconditionally so outputs.successful_count and
outputs.failed_count are populated even when some copies fail.
Overview:
Today any test/build failure in
nightly-ciblocks thereleasejob, so the GitLab pipeline never sees a partial nightly. This PR moves the policy decision out of GitHub Actions and into GitLab: always run publish + the GitLab trigger, forward anUPSTREAM_FAILEDflag, and let the GitLab side require human approval before promoting (gate to be added on the GitLab side separately).Details:
nightly-ci.yml(releasejob): drop!failure()from theif; addvllm-build/sglang-build/trtllm-buildtoneedsso cascading skips surface as direct failures; forwardupstream_failed(computed fromcontains(needs.*.result, 'failure' | 'cancelled')) intorelease.yml.release.yml: newupstream_failedinput on bothworkflow_dispatchandworkflow_call.trigger-gitlab-release-pipeline'sifswitches to!cancelled() && needs.prepare-release.result == 'success' && inputs.skip_gitlab_pipeline != trueso a failedrelease-publishorstage-wheels-artifactoryno longer skips the trigger.stage-wheels-artifactory: probes thedynamo-runtime-cuda12ECR tag withcrane manifestfirst; on miss it setswheels_staged=falseand exits 0 instead of failing the run. Upload step is gated onwheels_staged. Job exposeswheels_stagedas an output.release-publish: exposessuccessful_count/failed_countas job outputs (the inner script already collected them).HAD_FAILURESflag (input + publish result/failed_count+ wheels result/staged) ships as theUPSTREAM_FAILEDGitLab variable.ARTIFACTSis now built dynamically — only includescontainer/wheel/helmfor things that actually staged.Behavior reminder: a test-only failure (builds all green) still publishes containers to NGC and stages wheels to Artifactory; only the
UPSTREAM_FAILEDflag flips so GitLab can require review.Where should the reviewer start?
.github/workflows/nightly-ci.yml— thereleasejob'sneeds/if/withblock..github/workflows/release.yml—stage-wheels-artifactoryextract step (graceful miss) andtrigger-gitlab-release-pipeline(composite flag + dynamicARTIFACTS).Test plan:
workflow_dispatchofnightly-ciwithrun_tests=false— verifyreleaseruns,UPSTREAM_FAILED=falsereaches GitLab,ARTIFACTS=container,wheel.workflow_dispatch— verifyreleaseruns despite test failure,UPSTREAM_FAILED=true.stage-wheels-artifactorywarns and exits 0, GitLab trigger fires withwheeldropped fromARTIFACTS.Related Issues:
Summary by CodeRabbit