Skip to content

Add Transit Gateway egress mode#115

Merged
sir-sigurd merged 14 commits into
mainfrom
tgw-egress-toggle
Jun 9, 2026
Merged

Add Transit Gateway egress mode#115
sir-sigurd merged 14 commits into
mainfrom
tgw-egress-toggle

Conversation

@drernie

@drernie drernie commented Jun 5, 2026

Copy link
Copy Markdown
Member

For reviewers: picked up from the original author @drernie@sir-sigurd is now carrying this PR through review (direct questions here, not to @drernie).

What is changing

This adds a first-class Transit Gateway egress mode to the Terraform quilt/vpc modules. When enable_transit_gateway = true (with transit_gateway_id) while creating a new VPC, Quilt still owns the VPC, subnets, endpoints, and route tables, but it disables NAT gateways and the IPv6 egress-only IGW, creates a TGW VPC attachment in the intra subnets, and points each private route table's IPv4 default route at the TGW. transit_gateway_id may be a value known only after apply (e.g. a TGW created in the same configuration) — the toggle is the separate enable_transit_gateway bool, so count stays resolvable at plan time. IPv6 egress through the TGW is opt-in via transit_gateway_ipv6_egress (default off, since not every TGW carries IPv6); when enabled it points the IPv6 default route at the TGW and enables IPv6 support on the attachment. Set it true only if the TGW actually carries IPv6 egress, since pointing ::/0 at a TGW that can't route IPv6 black-holes the traffic and stalls clients without Happy Eyeballs (e.g. Python). With it off, the new VPC has no IPv6 default route, so an IPv6 attempt fails immediately and the client uses IPv4 with no delay.

Why

A customer needs Quilt-created infrastructure to egress through an existing network path instead of Quilt-created NAT gateways. This keeps the deployment on the supported Terraform path while letting the customer supply the TGW as the egress boundary.

What was considered

  • Customer-owned VPCs were set aside because they recreate the unsupported VPC-module mismatch this work is meant to remove.
  • Out-of-band route table edits were set aside because Quilt updates would not own or preserve that topology.
  • Using transit_gateway_id != null as the sole toggle was set aside: that gates resource count, which must be known at plan time, so a transit_gateway_id created in the same configuration (a computed value) would fail with "Invalid count argument." An explicit enable_transit_gateway bool (the idiomatic enable_*/create_* pattern, mirroring enable_nat_gateway) keeps the count plan-resolvable and lets the id be computed, so the TGW can be provisioned in the same config as the stack.

Risks and reversibility

The apply depends on the customer sharing the TGW into the deployment account and configuring TGW route tables/return paths correctly. If those prerequisites are wrong, attachment or egress validation can fail. The change is reversible by setting enable_transit_gateway = false, which restores the existing NAT gateway and IPv6 egress-only IGW behavior for newly-created VPCs.

Open questions

  • The customer must confirm the TGW RAM share, attachment acceptance behavior, and TGW-side route tables before production apply.

What approval means

Approval means Quilt supports this as the Terraform-owned way to create a new Quilt VPC whose private-subnet default egress routes through a customer-provided TGW.

Validation

Automated (CI, no AWS credentials, no infrastructure)

  • terraform fmt -check -recursive and terraform validate for all modules.
  • terraform test in modules/vpc — plan-time, mocked-provider tests of the new-vs-existing network input validation, including TGW egress: a new VPC with enable_transit_gateway = true plans and creates the attachment + the IPv4 default egress route (and the IPv6 route only when transit_gateway_ipv6_egress is set); existing-VPC + enable_transit_gateway is rejected.
  • terraform test in modules/quilt/tests/smoke — plan-time smoke test of the public module against a mocked provider, including the TGW path.
  • These run in CI alongside fmt/validate.

End-to-end (live, on an internal dev stack)

Deployed a full Quilt stack with enable_transit_gateway = true — the TGW created in the same configuration (computed transit_gateway_id), NAT gateways disabled — and confirmed egress actually flows through the TGW:

  • The CloudFormation stack reached CREATE_COMPLETE and the catalog served traffic.
  • Outbound control-plane egress succeeded: license validation, transactional email (a password-reset email was delivered), SSO/OIDC login, and Mixpanel telemetry (the failure-silent path — events were received for the stack).
  • NAT and TGW byte/connection metrics were non-zero, and an in-container check from the registry reported its public egress IP equal to the egress VPC's NAT address — i.e. traffic exits private subnet → TGW → egress VPC NAT → internet.
  • Also confirms the computed-transit_gateway_id path applies cleanly (no "Invalid count argument").
  • Verified teardown: removing the stack disables the mode and restores normal egress.

Links

Greptile Summary

This PR adds a first-class Transit Gateway egress mode to the vpc and quilt modules. When enable_transit_gateway = true (only with create_new_vpc = true), the module disables NAT gateways and the IPv6 egress-only IGW, creates a TGW VPC attachment in the intra subnets, and routes each private route table's default IPv4 (and optionally IPv6) egress through the TGW.

  • Core toggle design: The enable_transit_gateway bool gates all resource count expressions, while transit_gateway_id can be a computed/unknown value — this avoids the "Invalid count argument" planning error that would occur if the count depended on a not-yet-known TGW ID.
  • Validation coverage: Invalid combinations (existing VPC + TGW, enabled-without-id, bad-format ID, id-without-enable) are all covered by separate expect_failures tests in modules/vpc/tests/validation.tftest.hcl, plus positive-path smoke tests threaded through modules/quilt.
  • Documentation: README, CHANGELOG, VARIABLES.md, and examples/main.tf are updated consistently, with clear reversibility and egress-IP-change warnings.

Confidence Score: 5/5

Safe to merge. The toggle design is architecturally sound, validation coverage is comprehensive, and the end-to-end test confirms the computed-ID path works correctly.

The transit_gateway_enabled local correctly ties all resource counts to new_network_valid && var.enable_transit_gateway — both values are known at plan time, preventing Invalid count argument even when transit_gateway_id is computed. The lifecycle precondition catches the null-ID case, the variable validation rejects malformed IDs, and the existing_network_requires map blocks the unsupported existing-VPC combination. All seven new test cases (positive and negative paths) exercise these invariants, and the PR's end-to-end validation confirms the full flow on a live stack.

No files require special attention.

Important Files Changed

Filename Overview
modules/vpc/main.tf Core TGW implementation: attachment in intra subnets, IPv4/IPv6 routes in private route tables, NAT/egress-IGW disabled via !var.enable_transit_gateway, transit_gateway_enabled local gates all counts on new_network_valid. Logic is correct and well-commented.
modules/vpc/variables.tf Three new variables with correct defaults, descriptions, and a regex format-validation on transit_gateway_id (null-safe, skipped for unknown computed values).
modules/vpc/tests/validation.tftest.hcl Seven new test cases covering positive paths (IPv4-only, IPv4+IPv6), existing-VPC rejection, enabled-without-id rejection, bad-format-id rejection, and id-without-enable no-op. Full assertion coverage of attachment attributes and route counts.
modules/quilt/variables.tf Mirrors vpc module's three new variables with identical descriptions and the same transit_gateway_id format validation; correctly passes through to the vpc submodule.
modules/quilt/tests/smoke/smoke.tftest.hcl Two new smoke runs (IPv4-only TGW and IPv4+IPv6 TGW) confirm the full quilt module wires end-to-end; negative-path tests are intentionally left to the vpc validation suite.
modules/quilt/main.tf Three new variable pass-throughs to the vpc submodule; straightforward and correct.
README.md New Transit Gateway egress section with HCL example, prerequisites checklist, CIDR uniqueness note, IPv6 opt-in warning, and reversibility/egress-IP-change advisory.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([enable_transit_gateway]) -->|true| B{new_network_valid?}
    A -->|false| NAT[NAT gateways + egress-only IGW]
    B -->|false| ERR[configuration_error precondition fires]
    B -->|true| C[transit_gateway_enabled = true]
    C --> D[aws_ec2_transit_gateway_vpc_attachment.egress count=1]
    D --> PREC{transit_gateway_id != null?}
    PREC -->|false| FAIL[Precondition failure]
    PREC -->|true| E[IPv4 routes: aws_route.private_tgw_ipv4_egress x N private RTs]
    E --> F{transit_gateway_ipv6_egress?}
    F -->|true| G[IPv6 routes: aws_route.private_tgw_ipv6_egress x N private RTs - ipv6_support=enable on attachment]
    F -->|false| H[No IPv6 default route - ipv6_support=disable on attachment]
    NAT --> I[S3 gateway endpoint unchanged in both modes]
    G --> I
    H --> I
Loading

Reviews (3): Last reviewed commit: "Address review follow-up: egress-IP doc,..." | Re-trigger Greptile

Comment thread modules/vpc/main.tf

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Transit Gateway (TGW) egress mode to the Terraform modules/vpc and exposes it through modules/quilt, allowing private-subnet default routes to egress via a customer-provided TGW instead of Quilt-managed NAT gateways / IPv6 egress-only IGW.

Changes:

  • Introduces transit_gateway_id input (quilt + vpc modules) to toggle TGW egress behavior.
  • Disables NAT gateways and IPv6 egress-only IGW when transit_gateway_id is set, creates a TGW VPC attachment, and adds IPv4/IPv6 default routes in private route tables to the TGW.
  • Documents the new option in examples/main.tf.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
modules/vpc/variables.tf Adds transit_gateway_id input to the VPC module.
modules/vpc/main.tf Implements TGW attachment + private default routes; disables NAT/egress-only IGW when TGW is set.
modules/quilt/variables.tf Exposes transit_gateway_id on the public quilt module.
modules/quilt/main.tf Wires transit_gateway_id through to the VPC submodule.
examples/main.tf Adds example/commented usage for transit_gateway_id.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/vpc/variables.tf
Comment thread modules/quilt/variables.tf
Address PR review feedback on the TGW egress mode:
- Reject empty/malformed transit_gateway_id via a format validation on
  both the vpc and quilt module variables.
- Fail fast when transit_gateway_id is set with create_new_vpc == false
  (existing-VPC mode, where it would otherwise be silently ignored) by
  adding it to the vpc module's existing-network configuration_error
  precondition.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sir-sigurd and others added 7 commits June 8, 2026 14:23
These pin the intended behavior of the TGW egress mode and fail against the
current validation bug (the `transit_gateway_id == null` check sits in the
new-VPC requirement set, so a new VPC + transit_gateway_id wrongly trips
configuration_error and the feature is inert):

- vpc: new-VPC + transit_gateway_id must plan and create the TGW attachment +
  IPv4/IPv6 default egress routes; existing-VPC + transit_gateway_id must be
  rejected.
- quilt smoke: new-VPC + transit_gateway_id must plan end-to-end (wrapper now
  threads transit_gateway_id).

The validation fix follows in the next commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g-VPC path

The transit_gateway_id == null check was in new_network_requires, so creating a
new VPC with a transit_gateway_id (the supported egress mode) failed the
new-network requirement set and tripped configuration_error — the feature could
never be applied. It also left existing-VPC + transit_gateway_id silently
accepted rather than rejected.

Move the check to existing_network_requires, which expresses the actual intent
("TGW egress requires create_new_vpc == true"):
- new VPC + transit_gateway_id now plans and creates the attachment + routes;
- existing VPC + transit_gateway_id is rejected during plan;
- existing VPC without a TGW is unaffected (transit_gateway_id == null holds).

Turns the failing tests from the previous commit green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Routing the private subnets' ::/0 at the TGW unconditionally assumed the TGW
carries IPv6. If it doesn't, IPv6 traffic is black-holed (worse than no route,
which lets IPv4 fall back cleanly) — an assumption the general-purpose module
should not bake in.

Add transit_gateway_ipv6_egress (bool, default false) on the vpc and quilt
modules. When false (default), no ::/0 -> TGW route is created and the
attachment advertises ipv6_support = "disable"; the established IPv4
control-plane egress is unaffected. Operators whose TGW does carry IPv6 opt in.

Tests cover both: IPv6 off (no v6 route, attachment ipv6_support disabled) and
IPv6 on (v6 route present, ipv6_support enabled).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The toggle was `transit_gateway_id != null`, used in `count`. That forbids a
transit_gateway_id known only after apply (e.g. a TGW created in the same
configuration): count must be known at plan, so a computed id raises
"Invalid count argument". This is the idiomatic terraform-aws-modules pattern
(create_*/enable_* bool next to the value param, like enable_nat_gateway).

Add `enable_transit_gateway` (bool, default false) on the vpc and quilt
modules; gate the attachment/route counts and the NAT/EOIGW disable on it.
transit_gateway_id is now the value (required when enabled, may be computed),
guarded by a precondition on the attachment. Existing-VPC + enabled is still
rejected via the configuration_error precondition.

Tests updated to set enable_transit_gateway; verified a computed
transit_gateway_id (same-config TGW) now plans cleanly. Example updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- README: "Transit Gateway egress" section — what it does, how to enable
  (enable_transit_gateway + transit_gateway_id, computed id allowed), the
  operator-side TGW prerequisites (RAM share/accept, egress + return routes),
  CIDR-uniqueness requirement, IPv6 opt-in, and reversibility.
- VARIABLES: rows for enable_transit_gateway, transit_gateway_id,
  transit_gateway_ipv6_egress.
- CHANGELOG: [Unreleased] entry for the feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sir-sigurd sir-sigurd requested a review from Copilot June 9, 2026 10:33
@sir-sigurd

Copy link
Copy Markdown
Member

@greptileai please re-review — significant changes since the last pass: the toggle is now an explicit enable_transit_gateway bool (so transit_gateway_id may be computed), plus terraform test coverage and docs.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread examples/main.tf
Comment thread README.md Outdated
The previous wording implied IPv6 "falls back to IPv4" cleanly. Correct it:
with transit_gateway_ipv6_egress off there is no IPv6 default route, so an
IPv6 connection attempt fails immediately (ENETUNREACH) and the client uses
IPv4 with no delay. The slow case is the opposite — enabling it against a TGW
that does not carry IPv6 points ::/0 at a black hole, and clients without
Happy Eyeballs (e.g. Python requests/urllib3) stall on the connection timeout.
So the warning belongs on enabling it, not on the (safe) default-off path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@sir-sigurd sir-sigurd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up round — this is likely the last. The core mechanism checks out: count stays plan-resolvable, the standalone routes don't collide with the upstream module's routes (it creates neither a NAT default route nor an egress-only-IGW route when both are disabled), S3 stays on the gateway endpoint, the IPv6 route and ipv6_support are gated on the same variable so they can't drift apart, and existing non-TGW deployments see no plan diff (the now-explicit create_egress_only_igw = true matches the prior module default in vpc 6.6.1). The transit_gateway_id precondition fails at plan time for a known null and correctly defers only for a computed ID.

Design: sound on the mechanism — Transit Gateway routing, template safety (count/route-collision), and upgrade-safety for existing deployments all verified. One operational doc gap on the toggle's egress-IP side effect, below. No blockers.

Documentation

P2 — The reversibility note covers NAT/EIP recreation and the egress interruption, but not that the stack's public egress IP changes when the mode is toggled. Destroying the NAT gateways releases their Elastic IPs, and AWS won't hand the same addresses back on reallocation; toggling on routes egress out the customer's NAT instead. Either direction changes the outbound IP, so anything that allowlists Quilt's egress address (a license endpoint, a partner firewall, a SaaS IP allowlist) breaks silently — and it looks unrelated to the change. Worth one explicit sentence.

P3examples/main.tf:142 still reads # ... otherwise IPv6 is left on IPv4 fallback. That's the wording the README was already corrected away from (no IPv6 default route → immediate ENETUNREACH → IPv4, rather than a "fallback"). Align the comment with the README mechanism.

Tests

P3 — The added runs cover the happy paths and the existing-VPC rejection, but not the negative paths the module newly guards: the transit_gateway_id = null precondition failure (enable_transit_gateway = true), the transit_gateway_id regex rejection, and the "id set but enable_transit_gateway = false" no-op. These are the guards most likely to silently regress in a future refactor. (The smoke runs asserting only stack_name is fine — the resource-level assertions live in the vpc tests.) Add-only.

Nits

P3 (reshape)transit_gateway_id's description drifts between modules: the quilt wrapper says "when creating a new VPC" but the vpc module — which actually enforces the constraint — doesn't. If anything the scoping clause belongs on the vpc side.

P3 (reshape) — The gate local.new_network_valid && var.enable_transit_gateway is repeated across all three count expressions. A derived local alongside the existing *_valid flags would give it one source of truth; the IPv6 route would then read as a clear further-narrowing of it.

P3 (reshape) — A one-line comment on why the attachment lands in intra_subnets (deliberate — it's only ENI placement; the default routes go in the private route tables) would keep a future maintainer from "fixing" it into the private subnets.

P3 — The ^tgw-[0-9a-f]+$ validation is unanchored on length, so tgw-0 passes; real IDs are tgw- + 17 hex. It catches gross typos but not a truncated ID — optional tightening.


Reviewed against 91645b6

- README: note that toggling the mode changes the stack's public egress IP
  (NAT EIPs released on disable; egress via the TGW's NAT when enabled), so
  anything allowlisting Quilt's egress address must be updated or it breaks
  silently.
- examples/main.tf: align the IPv6 comment with the corrected mechanism (off =
  no IPv6 default route, clients use IPv4) instead of "IPv4 fallback".
- vpc tests: add negative-path runs — enabled-without-id (attachment
  precondition), malformed id (variable validation), and id-set-without-enable
  (no-op: no attachment/routes).
- vpc main.tf: hoist the repeated `new_network_valid && enable_transit_gateway`
  count gate into a `transit_gateway_enabled` local; add a comment on why the
  attachment lands in intra subnets (ENI placement only).
- quilt: align transit_gateway_id description with the vpc module (the new-VPC
  scoping lives on enable_transit_gateway).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sir-sigurd sir-sigurd requested a review from Copilot June 9, 2026 12:09
@sir-sigurd

Copy link
Copy Markdown
Member

@greptileai please re-review — addressed the follow-up: egress-IP reversibility note, negative-path tests (enabled-without-id, bad-format id, id-without-enable no-op), the transit_gateway_enabled local, and doc/description alignment.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

@sir-sigurd sir-sigurd requested a review from nl0 June 9, 2026 12:18
@sir-sigurd sir-sigurd self-assigned this Jun 9, 2026
@sir-sigurd

Copy link
Copy Markdown
Member

@drernie
please check the user facing part of PR

Comment thread README.md Outdated
Comment thread README.md Outdated
@sir-sigurd sir-sigurd mentioned this pull request Jun 9, 2026
sir-sigurd and others added 2 commits June 9, 2026 19:34
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The applied review suggestions left two long unwrapped lines; re-flow them
to match the surrounding wrap convention (wording unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@sir-sigurd sir-sigurd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carried this over from @drernie. TGW egress mode: toggle design is plan-safe (computed transit_gateway_id works), validation + smoke tests cover positive/negative paths, and the feature was validated end-to-end on a live dev stack (egress confirmed through the TGW). Docs reviewed by @drernie. CI green. Approving.

@sir-sigurd sir-sigurd merged commit f48e36b into main Jun 9, 2026
8 checks passed
@sir-sigurd sir-sigurd deleted the tgw-egress-toggle branch June 9, 2026 14:41
@sir-sigurd sir-sigurd mentioned this pull request Jun 9, 2026
sir-sigurd added a commit that referenced this pull request Jun 9, 2026
Cut release 1.8.0 — promotes the Transit Gateway egress mode (#115) from Unreleased into a dated CHANGELOG section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants