test(uat): allow NATS ingress on UAT AWS/GCP clusters#1376
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughTwo UAT cluster configuration files are updated to enable and document NATS traffic on TCP port 4222. In the AWS EKS config, a new CIDR ( Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
yuanchen8911
left a comment
There was a problem hiding this comment.
The first issue is a blocker. Please take a look
AWS: add an EKS security-group ingress rule for NATS (4222/tcp). Source both the worker node subnet (10.0.128.0/17) and the pod subnet (100.65.0.0/16) — VPC-CNI does not SNAT in-VPC pod-to-pod traffic, so the system node hosting NATS sees the pod source IP, not the node IP. Also add the HQ CIDR (128.77.49.32/30) to the control-plane endpoint allowlist for parity with the GKE authorizedNetworks. GCP: declare explicit system/worker node subnets. No dedicated NATS firewall rule is added — the existing nccl-internal rule already allows all intra-VPC TCP from 10.0.0.0/8 (covering node and pod secondary ranges), which includes 4222.
fd0d348 to
002bdac
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
All four findings are addressed:
- AWS 4222 rule now sources the pod CIDR
100.65.0.0/16(alongside the node subnet) — pod-originated NATS will match. nccl-internalkept at10.0.0.0/8(tightening reverted) — pod secondary ranges retained.allow-natsdropped as redundant —nccl-internalalready permits intra-VPC 4222.- Control-plane allowlist (
128.77.49.32/30) description corrected (HQ-access, not worker CIDR).
LGTM.
Summary
Allow NATS (4222/tcp) ingress on the UAT AWS and GCP clusters.
Motivation / Context
Dynamo's NATS clients (in worker pods) could not reach the NATS server on the system nodes in the UAT environments.
Fixes: N/A
Related: #1369
Type of Change
Component(s) Affected
tests/uat/)Implementation Notes
10.0.128.0/17) and the pod subnet (100.65.0.0/16) — AWS VPC-CNI does not SNAT in-VPC pod-to-pod traffic, so the system node hosting NATS sees the pod source IP, not the node IP. A node-only rule would never match.128.77.49.32/30to the control-plane endpointallowedCidrs, for parity with the GKEauthorizedNetworksnw3entry. (This is an HQ-access addition, not the worker CIDR — corrected from the earlier description.)system/workernode subnets. No dedicated NATS firewall rule is added — the existingnccl-internalrule already allows all intra-VPC TCP from10.0.0.0/8(covering node primaries and the GKE pod secondary ranges), which includes 4222. The initial draft tightenednccl-internalto node-only CIDRs and added a redundantallow-natsrule; both were reverted after review since they would have dropped pod-range coverage.ai-service-metrics) is already enabled by default, so no rule is needed for Prometheus queries from worker nodes.Testing
# Config-only change to UAT infra YAML; no code paths exercised by make qualify.Risk Assessment
Rollout notes: Applied when the UAT clusters are next provisioned/reconciled. N/A for runtime code.
Checklist
make testwith-race)make lint)git commit -S)