Skip to content

ci: harden & speed up workflows (concurrency, caching, timeouts, unit lane, resilient integration tests)#428

Merged
james-tn merged 1 commit into
mainfrom
ci/harden-workflows
Jun 9, 2026
Merged

ci: harden & speed up workflows (concurrency, caching, timeouts, unit lane, resilient integration tests)#428
james-tn merged 1 commit into
mainfrom
ci/harden-workflows

Conversation

@james-tn

@james-tn james-tn commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Hardens and speeds up the GitHub Actions workflows. Motivated by both PRs #426 and #427 going red on integration-tests for an environmental reason — the tests run against a pre-deployed, shared backend (ca-be-002) whose model call returns 500, with build/deploy skipped, so the failure had nothing to do with those PRs.

Tier 1 — reliability & speed

  • Concurrency groups: orchestrate (per PR/branch; never cancels an in-flight main deploy), promote-to-main (latest-only), destroy (per-env, never cancels a destroy mid-apply). Prevents concurrent Terraform state access on the same environment.
  • timeout-minutes on every job — no more potential 6-hour hung jobs.
  • Caching: pip cache in integration-tests; uv cache in the new unit lane.
  • New Unit & Regression Tests (no Azure) job runs the 64 mock-based agent-framework regression tests on every PR/push. Fast, deterministic, no cloud — catches API/dependency breakage (exactly the kind of thing the 1.8.0 upgrade touched) long before the expensive deploy path. (Validated locally: 64 passed.)
  • Readiness poll replaces the blind sleep 30.
  • JUnit artifacts uploaded for both lanes.

Tier 2 — resilience & least privilege

Tier 3 — hygiene

  • Standardized action versions off the Node 20 deprecation flagged in recent runs: actions/checkout@v4 → v6, actions/setup-python@v5 → v6 across all workflows.

Validation

  • All 9 workflow files parse as valid YAML.
  • The new unit-lane command runs locally and emits junit-unit.xml (64 passed).
  • Scope is limited to .github/workflows/ only.

Recommended follow-ups (not in this PR)

  • Convert docker-application/docker-mcp from raw docker build to buildx + GHA layer cache (left out to avoid changing the proven build/push path untested).
  • Make the new unit-tests job a required status check via branch protection so it gates merges.

…it lane, resilient integration tests)

Tier 1 — reliability & speed:
- Add concurrency groups: orchestrate (per PR/branch; never cancels main
  deploys), promote-to-main (latest-only), destroy (per-env, never cancels).
- Add timeout-minutes to every job (no more 6h hung jobs).
- pip caching in integration-tests; uv caching in the new unit lane.
- New 'Unit & Regression Tests (no Azure)' job runs the 64 mock-based
  agent-framework regression tests on every PR/push — a fast, deterministic
  signal that catches API/dependency breakage long before the deploy path.
- Replace the blind 'sleep 30' with a backend readiness poll.
- Upload JUnit results as artifacts (integration + unit lanes).

Tier 2 — resilience & least privilege:
- Integration tests gain an 'advisory' mode. For tests-only PRs to main the
  shared env is NOT built from the PR, so a degraded env (e.g. invalid model
  key → 500) is now reported as a warning instead of blocking unrelated PRs.
  A liveness gate distinguishes 'env unreachable/degraded' from 'tests failed'.
- Scope per-job permissions to least privilege on all inline jobs.

Tier 3 — hygiene:
- Standardize action versions off the Node 20 deprecation:
  actions/checkout@v4 → v6, actions/setup-python@v5 → v6 across all workflows.

Validated: all 9 workflow files parse; the unit-lane command runs locally
(64 passed) and emits junit-unit.xml.

Note (follow-up): docker-application/docker-mcp still use raw 'docker build';
converting to buildx + GHA layer cache is a worthwhile next step but was left
out here to avoid changing the proven build/push path untested.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@james-tn james-tn merged commit fb2acbb into main Jun 9, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant