ci: harden & speed up workflows (concurrency, caching, timeouts, unit lane, resilient integration tests)#428
Merged
Merged
Conversation
…it lane, resilient integration tests) Tier 1 — reliability & speed: - Add concurrency groups: orchestrate (per PR/branch; never cancels main deploys), promote-to-main (latest-only), destroy (per-env, never cancels). - Add timeout-minutes to every job (no more 6h hung jobs). - pip caching in integration-tests; uv caching in the new unit lane. - New 'Unit & Regression Tests (no Azure)' job runs the 64 mock-based agent-framework regression tests on every PR/push — a fast, deterministic signal that catches API/dependency breakage long before the deploy path. - Replace the blind 'sleep 30' with a backend readiness poll. - Upload JUnit results as artifacts (integration + unit lanes). Tier 2 — resilience & least privilege: - Integration tests gain an 'advisory' mode. For tests-only PRs to main the shared env is NOT built from the PR, so a degraded env (e.g. invalid model key → 500) is now reported as a warning instead of blocking unrelated PRs. A liveness gate distinguishes 'env unreachable/degraded' from 'tests failed'. - Scope per-job permissions to least privilege on all inline jobs. Tier 3 — hygiene: - Standardize action versions off the Node 20 deprecation: actions/checkout@v4 → v6, actions/setup-python@v5 → v6 across all workflows. Validated: all 9 workflow files parse; the unit-lane command runs locally (64 passed) and emits junit-unit.xml. Note (follow-up): docker-application/docker-mcp still use raw 'docker build'; converting to buildx + GHA layer cache is a worthwhile next step but was left out here to avoid changing the proven build/push path untested. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens and speeds up the GitHub Actions workflows. Motivated by both PRs #426 and #427 going red on
integration-testsfor an environmental reason — the tests run against a pre-deployed, shared backend (ca-be-002) whose model call returns 500, withbuild/deployskipped, so the failure had nothing to do with those PRs.Tier 1 — reliability & speed
orchestrate(per PR/branch; never cancels an in-flightmaindeploy),promote-to-main(latest-only),destroy(per-env, never cancels a destroy mid-apply). Prevents concurrent Terraform state access on the same environment.timeout-minuteson every job — no more potential 6-hour hung jobs.pipcache in integration-tests;uvcache in the new unit lane.Unit & Regression Tests (no Azure)job runs the 64 mock-basedagent-frameworkregression tests on every PR/push. Fast, deterministic, no cloud — catches API/dependency breakage (exactly the kind of thing the 1.8.0 upgrade touched) long before the expensive deploy path. (Validated locally: 64 passed.)sleep 30.Tier 2 — resilience & least privilege
main(full_deploy=false), the shared env isn't built from the PR, so a degraded env is now reported as a warning instead of blocking unrelated PRs. A liveness gate distinguishes env unreachable/degraded from tests failed. This directly fixes the deps: upgrade agent-framework 1.7.0 -> 1.8.0 #426/fix(durable-demo): make fraud-detection workflow reliably runnable on Windows #427 false-negative.permissionson all inline jobs.Tier 3 — hygiene
actions/checkout@v4 → v6,actions/setup-python@v5 → v6across all workflows.Validation
junit-unit.xml(64 passed)..github/workflows/only.Recommended follow-ups (not in this PR)
docker-application/docker-mcpfrom rawdocker buildtobuildx+ GHA layer cache (left out to avoid changing the proven build/push path untested).unit-testsjob a required status check via branch protection so it gates merges.