Skip to content

test(uat): allow NATS ingress on UAT AWS/GCP clusters#1376

Merged
mchmarny merged 1 commit into
mainfrom
test/uat-nats-firewall-rules
Jun 16, 2026
Merged

test(uat): allow NATS ingress on UAT AWS/GCP clusters#1376
mchmarny merged 1 commit into
mainfrom
test/uat-nats-firewall-rules

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Allow NATS (4222/tcp) ingress on the UAT AWS and GCP clusters.

Motivation / Context

Dynamo's NATS clients (in worker pods) could not reach the NATS server on the system nodes in the UAT environments.

Fixes: N/A
Related: #1369

Type of Change

  • Build/CI/tooling

Component(s) Affected

  • Other: UAT cluster infra config (tests/uat/)

Implementation Notes

  • AWS: added an EKS security-group ingress rule for 4222/tcp. It sources both the worker node subnet (10.0.128.0/17) and the pod subnet (100.65.0.0/16) — AWS VPC-CNI does not SNAT in-VPC pod-to-pod traffic, so the system node hosting NATS sees the pod source IP, not the node IP. A node-only rule would never match.
  • AWS: added the HQ CIDR 128.77.49.32/30 to the control-plane endpoint allowedCidrs, for parity with the GKE authorizedNetworks nw3 entry. (This is an HQ-access addition, not the worker CIDR — corrected from the earlier description.)
  • GCP: declared explicit system/worker node subnets. No dedicated NATS firewall rule is added — the existing nccl-internal rule already allows all intra-VPC TCP from 10.0.0.0/8 (covering node primaries and the GKE pod secondary ranges), which includes 4222. The initial draft tightened nccl-internal to node-only CIDRs and added a redundant allow-nats rule; both were reverted after review since they would have dropped pod-range coverage.
  • Port 9090 (Prometheus query, ai-service-metrics) is already enabled by default, so no rule is needed for Prometheus queries from worker nodes.

Testing

# Config-only change to UAT infra YAML; no code paths exercised by make qualify.

Risk Assessment

  • Low — Isolated change to UAT-only infra config, easy to revert

Rollout notes: Applied when the UAT clusters are next provisioned/reconciled. N/A for runtime code.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@mchmarny mchmarny requested a review from a team as a code owner June 16, 2026 00:13
@mchmarny mchmarny added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 16, 2026
@mchmarny mchmarny self-assigned this Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4343c62a-4d88-46eb-bb5f-e1894c45ee0f

📥 Commits

Reviewing files that changed from the base of the PR and between fd0d348 and 002bdac.

📒 Files selected for processing (2)
  • tests/uat/aws/cluster-config.yaml
  • tests/uat/gcp/cluster-config.yaml

📝 Walkthrough

Walkthrough

Two UAT cluster configuration files are updated to enable and document NATS traffic on TCP port 4222. In the AWS EKS config, a new CIDR (128.77.49.32/30) is appended to the control plane allowedCidrs, and a new ingress security group rule is added under network.eks.securityGroups.additionalRules permitting TCP port 4222 from the worker subnet (10.0.128.0/17) and pod subnet (100.65.0.0/16). In the GCP GKE config, explicit node subnet entries (system-subnet, worker-subnet) with CIDR blocks are defined, and firewall rules documentation is updated to clarify that the existing 10.0.0.0/8 intra-VPC rule already covers node and pod ranges and permits NATS TCP port 4222.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/aicr#1369: Both PRs address EKS/GKE network configuration for NATS TCP port 4222, with the retrieved PR documenting required security group rules and this PR adding them to UAT cluster configs.

Suggested labels

size/M

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: allowing NATS ingress on UAT AWS/GCP clusters, which directly matches the changeset modifications to cluster configuration files.
Description check ✅ Passed The description is directly related to the changeset, providing detailed context about the NATS connectivity issue, implementation approach for both AWS and GCP, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/uat-nats-firewall-rules

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.1%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.1%25-green)

No Go source files changed in this PR.

Comment thread tests/uat/aws/cluster-config.yaml Outdated
Comment thread tests/uat/gcp/cluster-config.yaml Outdated
Comment thread tests/uat/gcp/cluster-config.yaml Outdated
Comment thread tests/uat/aws/cluster-config.yaml

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first issue is a blocker. Please take a look

AWS: add an EKS security-group ingress rule for NATS (4222/tcp). Source
both the worker node subnet (10.0.128.0/17) and the pod subnet
(100.65.0.0/16) — VPC-CNI does not SNAT in-VPC pod-to-pod traffic, so
the system node hosting NATS sees the pod source IP, not the node IP.
Also add the HQ CIDR (128.77.49.32/30) to the control-plane endpoint
allowlist for parity with the GKE authorizedNetworks.

GCP: declare explicit system/worker node subnets. No dedicated NATS
firewall rule is added — the existing nccl-internal rule already allows
all intra-VPC TCP from 10.0.0.0/8 (covering node and pod secondary
ranges), which includes 4222.
@mchmarny mchmarny force-pushed the test/uat-nats-firewall-rules branch from fd0d348 to 002bdac Compare June 16, 2026 01:02

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four findings are addressed:

  1. AWS 4222 rule now sources the pod CIDR 100.65.0.0/16 (alongside the node subnet) — pod-originated NATS will match.
  2. nccl-internal kept at 10.0.0.0/8 (tightening reverted) — pod secondary ranges retained.
  3. allow-nats dropped as redundant — nccl-internal already permits intra-VPC 4222.
  4. Control-plane allowlist (128.77.49.32/30) description corrected (HQ-access, not worker CIDR).

LGTM.

@mchmarny mchmarny enabled auto-merge (squash) June 16, 2026 01:12
@mchmarny mchmarny merged commit 12b86df into main Jun 16, 2026
31 checks passed
@mchmarny mchmarny deleted the test/uat-nats-firewall-rules branch June 16, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests size/S theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants