From 8411c5611255b391e233aa3c3ea7841210ba600e Mon Sep 17 00:00:00 2001 From: Wesley Dawson Date: Wed, 25 Feb 2026 18:08:43 -0800 Subject: [PATCH 1/4] chore(MZCLD-2261): auto tag and deploy prod on merges to main --- .github/workflows/build_and_push_to_gar.yml | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build_and_push_to_gar.yml b/.github/workflows/build_and_push_to_gar.yml index 2db87bd2..96fa0972 100644 --- a/.github/workflows/build_and_push_to_gar.yml +++ b/.github/workflows/build_and_push_to_gar.yml @@ -20,7 +20,7 @@ jobs: environment: build permissions: - contents: read + contents: write id-token: write steps: @@ -77,3 +77,16 @@ jobs: push: true cache-from: type=gha cache-to: type=gha,mode=max + + - name: auto-tag for prod build + if: github.ref == 'refs/heads/main' + run: | + # Get the latest semver tag + LATEST=$(git tag --list 'v[0-9]*.[0-9]*.[0-9]*' --sort=-v:refname | head -1) + # Increment patch version + MAJOR=$(echo "$LATEST" | cut -d. -f1) + MINOR=$(echo "$LATEST" | cut -d. -f2) + PATCH=$(echo "$LATEST" | cut -d. -f3) + NEW_TAG="${MAJOR}.${MINOR}.$((PATCH + 1))" + git tag "$NEW_TAG" + git push origin "$NEW_TAG" From 4bb3355f699fa2d7e789bebbd0167d9878f729e1 Mon Sep 17 00:00:00 2001 From: Wesley Dawson Date: Wed, 25 Feb 2026 18:26:56 -0800 Subject: [PATCH 2/4] chore(MZCLD-2261): update docs --- .github/pull_request_template.md | 6 +++--- docs/SRE_INFO.md | 8 ++++---- docs/refractr-architecture.md | 4 ++-- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 40e85559..01870f00 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -8,6 +8,6 @@ When creating a PR for Refractr, confirm you've done the following steps for a s - [ ] Have you checked if there are any TLS cert concerns - e.g. if the domain being redirected already exists, and it is being changed to point at refractr, is a temporary TLS 'outage' while waiting for certification via HTTP challenge okay? If not, add a note to the JIRA ticket. - [ ] If desired, have you generated the nginx config manually to confirm updates work as expected? -After PR merge, next steps include: -- [ ] A merge to the `main` branch will automatically deploy refractr's stage environment -- deploying the prod environment requires a GitHub release to be created. -- [ ] Once deployed, refractr's certmap must be updated and DNS entries must be changed -- SRE can help with this. Please pull someone in on the JIRA ticket or ask for help in #sre on Slack. +After PR merge: +- [ ] A merge to `main` automatically deploys both stage and prod (for prod, CI auto-creates the next semver tag). +- [ ] TLS certificates are created automatically by [Spacelift](https://mozilla.app.spacelift.io/stack/refractr-prod). DNS changes may still require SRE or IT to make changes in other systems (e.g. Markmonitor, Route53) -- ask in #mozcloud-support on Slack or in the JIRA ticket. diff --git a/docs/SRE_INFO.md b/docs/SRE_INFO.md index 029905d2..3587ad8f 100644 --- a/docs/SRE_INFO.md +++ b/docs/SRE_INFO.md @@ -47,15 +47,15 @@ When an update to e.g. `prod-refractr.yaml` was made, you can validate the syste #### deployment -We automatically deploy a stage & prod env for refractr. Stage is deployed whenever a merge to `main` is done, prod is deployed when a GitHub release is created (or a tag has been pushed), that matches the pattern `v\d+\.\d+\.\d+`. In both cases, GitHub actions will build the docker image, Argo CD will ship it to GKE. +We automatically deploy a stage & prod env for refractr. When a PR merges to `main`, CI builds a stage image (tagged with `git describe` output like `v0.0.225-3-gabcdef1`), then auto-creates the next semver tag (e.g., `v0.0.226`). The tag push triggers a second CI run that builds the prod image. Argo CD picks up both images automatically. No manual tagging or release creation is needed. #### update certificates -Refractr's Loadbalancer is build from Gateway API resources by GCP's Gateway API Controller. The system handles ~180 redirects and forces TLS for every refract, which results in ~360 SSL certificates attached to the Loadbalancer. To be able to make use of such a high number of SSL certificates at once, we're using GCP's certificate manager API to map hostnames to certificates. Certificate manager API defines a certmap (that is attached to the Gateway API Loadbalancer via an annotation), which in turn defines entries that map hostnames to certificates. +Refractr's Loadbalancer is built from Gateway API resources by GCP's Gateway API Controller. The system handles ~180 redirects and forces TLS for every refract, which results in ~360 SSL certificates attached to the Loadbalancer. To be able to make use of such a high number of SSL certificates at once, we're using GCP's certificate manager API to map hostnames to certificates. Certificate manager API defines a certmap (that is attached to the Gateway API Loadbalancer via an annotation), which in turn defines entries that map hostnames to certificates. -Certificates and certificate-map-entries are managed with terraform, because no method for building them alongside the Gateway API resources directly from GKE was available at the time of migrating to mozcloud. +Certificates and certificate-map-entries are managed with terraform, using a [terraform-module](https://github.com/mozilla/terraform-modules/tree/main/google_certificate_manager_certificate_map) that reads refract definitions from `prod-refractr.yml` directly via an HTTP data source. -To simplify this, we're using a [terraform-module](https://github.com/mozilla/terraform-modules/tree/main/google_certificate_manager_certificate_map), which reads all refract definitions in from a single yaml document. This document can be generated with a `doit` task, `$ poetry run doit certificate_manager_input`. Once generated and put in place, refractr's certmap can be updated with a PR in to the infra repo. +Certificate updates are automated via Spacelift. When a new prod image is deployed, argocd-image-updater commits the new image tag to `webservices-infra`, which triggers a Spacelift tracked run on the `refractr-prod` stack. A plan policy auto-approves changes to certificate manager resources. Additionally, hourly drift detection catches any certificate changes that may have been missed and auto-reconciles them. #### DNS diff --git a/docs/refractr-architecture.md b/docs/refractr-architecture.md index 4aee4799..e589ffd3 100644 --- a/docs/refractr-architecture.md +++ b/docs/refractr-architecture.md @@ -29,13 +29,13 @@ The refractr.yml spec allows for specifying tests in the form of given-source to ### minimal changes Due to the nature of redirects and rewrites it is common to add new domains or subtract old ones. This means that the nginx config needs to be told which are the valid list of domains and update them when deploying a new refractr Docker image to GKE. When a new version of the refractr image is pushed to prod, redirects are already live. -In a second step, certificates must be created and linked to refractr's Loadbalancer -- this step currently requires a second PR to be opened after deployment. All certificates are managed with GCP's certificate manager api and attached to the Loadbalancer by a certmap, we manage all of those resources via terraform in refractr's infrastructure project. +Certificates are updated automatically. The terraform configuration reads `prod-refractr.yml` directly via an HTTP data source, so certificate changes are detected by Spacelift drift detection and auto-reconciled without a separate PR. ## refractr traffic flow Traffic flow to refractr starts with DNS. A domain that should be handled by the system must be pointed to it's Loadbalancer, usually by a CNAME, in some cases, by A / AAAA records. Once a request reaches the Loadbalancer, we force HTTPS, then forward to the actual application pods, which then handle individual redirects as configured. ## Continuous Integration (CI) -CI is done with GitHub Actions. Tests run on every push to any branch in the repo. However, only pushes to the **main** branch and **tags** (matching the `/v[0-9]+.[0-9]+.[0-9]+/`) will cause an update of refractr's Docker image. In addition to tests, Pull Requests (PRs) require code reviews before allowing the change to be to the **main** branch. +CI is done with GitHub Actions. Tests run on every push to any branch in the repo. Pushes to the **main** branch build a stage image (tagged with `git describe` output), then CI auto-creates the next semver tag. The tag push triggers a second workflow run that builds the prod image. In addition to tests, Pull Requests (PRs) require code reviews before allowing the change to be merged to the **main** branch. ## The Handoff The handoff point between CI and CD is the Docker Repository. In this case we decided to use the Cloud Provider based Docker Repository. For GCP that is Google Artifact Registry (GAR). Images are named refractr and have the output of git describe for the image tag. Image tags are watched and deployed by Argo CD. From ea3eb0a7c6ced3f6b751421658eb72333b816818 Mon Sep 17 00:00:00 2001 From: Wesley Dawson Date: Fri, 27 Feb 2026 12:29:06 -0800 Subject: [PATCH 3/4] chore(MZCLD-2261): update approach --- .github/workflows/build_and_push_to_gar.yml | 18 ++++-------------- docs/SRE_INFO.md | 2 +- docs/refractr-architecture.md | 2 +- 3 files changed, 6 insertions(+), 16 deletions(-) diff --git a/.github/workflows/build_and_push_to_gar.yml b/.github/workflows/build_and_push_to_gar.yml index 96fa0972..d6a66956 100644 --- a/.github/workflows/build_and_push_to_gar.yml +++ b/.github/workflows/build_and_push_to_gar.yml @@ -5,6 +5,9 @@ on: branches: - main + # tag-based deployments are deprecated but can be reconfigured via + # https://github.com/mozilla/global-platform-admin/blob/main/tenants/refractr.yaml + # if desired tags: - v[0-9]+.[0-9]+.[0-9]+ @@ -20,7 +23,7 @@ jobs: environment: build permissions: - contents: write + contents: read id-token: write steps: @@ -77,16 +80,3 @@ jobs: push: true cache-from: type=gha cache-to: type=gha,mode=max - - - name: auto-tag for prod build - if: github.ref == 'refs/heads/main' - run: | - # Get the latest semver tag - LATEST=$(git tag --list 'v[0-9]*.[0-9]*.[0-9]*' --sort=-v:refname | head -1) - # Increment patch version - MAJOR=$(echo "$LATEST" | cut -d. -f1) - MINOR=$(echo "$LATEST" | cut -d. -f2) - PATCH=$(echo "$LATEST" | cut -d. -f3) - NEW_TAG="${MAJOR}.${MINOR}.$((PATCH + 1))" - git tag "$NEW_TAG" - git push origin "$NEW_TAG" diff --git a/docs/SRE_INFO.md b/docs/SRE_INFO.md index 3587ad8f..f2f7a5ab 100644 --- a/docs/SRE_INFO.md +++ b/docs/SRE_INFO.md @@ -47,7 +47,7 @@ When an update to e.g. `prod-refractr.yaml` was made, you can validate the syste #### deployment -We automatically deploy a stage & prod env for refractr. When a PR merges to `main`, CI builds a stage image (tagged with `git describe` output like `v0.0.225-3-gabcdef1`), then auto-creates the next semver tag (e.g., `v0.0.226`). The tag push triggers a second CI run that builds the prod image. Argo CD picks up both images automatically. No manual tagging or release creation is needed. +We automatically deploy a stage & prod env for refractr. Both environments use the same deployment: when a PR merges to `main`, CI builds a single image (tagged with `git describe` output like `v0.0.225-3-gabcdef1`) and Argo CD deploys it to both stage and prod. #### update certificates diff --git a/docs/refractr-architecture.md b/docs/refractr-architecture.md index e589ffd3..34e7d6eb 100644 --- a/docs/refractr-architecture.md +++ b/docs/refractr-architecture.md @@ -35,7 +35,7 @@ Certificates are updated automatically. The terraform configuration reads `prod- Traffic flow to refractr starts with DNS. A domain that should be handled by the system must be pointed to it's Loadbalancer, usually by a CNAME, in some cases, by A / AAAA records. Once a request reaches the Loadbalancer, we force HTTPS, then forward to the actual application pods, which then handle individual redirects as configured. ## Continuous Integration (CI) -CI is done with GitHub Actions. Tests run on every push to any branch in the repo. Pushes to the **main** branch build a stage image (tagged with `git describe` output), then CI auto-creates the next semver tag. The tag push triggers a second workflow run that builds the prod image. In addition to tests, Pull Requests (PRs) require code reviews before allowing the change to be merged to the **main** branch. +CI is done with GitHub Actions. Tests run on every push to any branch in the repo. Pushes to the **main** branch build a single image (tagged with `git describe` output) that is deployed to both stage and prod by Argo CD. In addition to tests, Pull Requests (PRs) require code reviews before allowing the change to be merged to the **main** branch. ## The Handoff The handoff point between CI and CD is the Docker Repository. In this case we decided to use the Cloud Provider based Docker Repository. For GCP that is Google Artifact Registry (GAR). Images are named refractr and have the output of git describe for the image tag. Image tags are watched and deployed by Argo CD. From a0309f8e7bb6f658c09001edac5904ee0ae8d09a Mon Sep 17 00:00:00 2001 From: whd Date: Fri, 27 Feb 2026 20:52:11 +0000 Subject: [PATCH 4/4] Update .github/pull_request_template.md Co-authored-by: Jon Buckley --- .github/pull_request_template.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 01870f00..8f884065 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -9,5 +9,5 @@ When creating a PR for Refractr, confirm you've done the following steps for a s - [ ] If desired, have you generated the nginx config manually to confirm updates work as expected? After PR merge: -- [ ] A merge to `main` automatically deploys both stage and prod (for prod, CI auto-creates the next semver tag). +- [ ] A merge to `main` automatically deploys both stage and prod. - [ ] TLS certificates are created automatically by [Spacelift](https://mozilla.app.spacelift.io/stack/refractr-prod). DNS changes may still require SRE or IT to make changes in other systems (e.g. Markmonitor, Route53) -- ask in #mozcloud-support on Slack or in the JIRA ticket.