docs(integrator): document Dynamo 1.2 NATS SG symptoms on EKS#1369
Conversation
📝 WalkthroughWalkthroughThe EKS Dynamo networking prerequisites document ( Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
ff14297 to
0b698ca
Compare
…-metrics affinity wording The EKS Dynamo networking guide already specifies the required NATS (4222) and Prometheus (9090) security-group rules, but the symptom list only mentioned 'JetStream not available'. Add the failure signature actually observed when the GPU->system SG blocks 4222 after the Dynamo 1.2 bump (which switched discovery to the NATS event plane): frontend/worker CrashLoopBackOff with 'Failed to connect to NATS: timed out', and the inference-perf validator failing (~15 min, after the workload-ready/health gates lapse) while deployment/conformance pass. Add a busybox reachability probe (with a catch-all toleration so it schedules on tainted GPU nodes). Also refresh the stale ai-service-metrics/Prometheus scheduling notes in both eks-dynamo-networking.md and user/validation.md: the orchestrator now sets a preferred dependencyAffinity toward Prometheus (issue NVIDIA#933, resolved). Reword the 'no node-affinity / no scheduling preference' claims as best-effort fallback behavior; the 9090 SG rule remains the reliable guarantee.
0b698ca to
b2bd003
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/eks-dynamo-networking.md`:
- Around line 29-38: The kubectl verification command in the shell script
references an outdated busybox image tag (1.36) that is no longer available on
Docker Hub. Update the --image parameter from busybox:1.36 to either
busybox:latest or busybox:1.38 (the current latest stable version) to ensure the
command works for users. This change ensures the probe pod can be successfully
pulled and executed when running the network reachability verification from a
GPU node.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 146b6626-4819-411b-a8f2-151af8bad53b
📒 Files selected for processing (1)
docs/integrator/eks-dynamo-networking.md
| You can confirm reachability directly from a GPU node before re-running. The | ||
| toleration is required because the GPU node groups on these clusters are | ||
| tainted (`NoSchedule`/`NoExecute`); without it the probe pod stays `Pending` | ||
| and never runs: | ||
|
|
||
| ```shell | ||
| kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \ | ||
| --overrides='{"spec":{"nodeSelector":{"<gpu-node-label-key>":"<value>"},"tolerations":[{"operator":"Exists"}]}}' \ | ||
| -- sh -c 'nc -zv -w 5 dynamo-platform-nats.dynamo-system.svc.cluster.local 4222' | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for the NATS service definition in the Dynamo Helm chart or manifests
rg -n "dynamo-platform.*nats" --type yaml
# Or search for services in the target namespace
rg -n "namespace.*dynamo" --type yamlRepository: NVIDIA/aicr
Length of output: 2108
🏁 Script executed:
# Look at the actual service definition in the assert-dynamo files
cat tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml | head -50Repository: NVIDIA/aicr
Length of output: 1796
🏁 Script executed:
# Check for port 4222 references in NATS service definitions
rg -n "4222|nats.*port" --type yaml -B 2 -A 2Repository: NVIDIA/aicr
Length of output: 37
🏁 Script executed:
# Verify busybox image tag availability and appropriateness
web_search
busybox 1.36 alpine docker image latest version 2024Repository: NVIDIA/aicr
Length of output: 152
🏁 Script executed:
# Search for the Service definition that exposes the NATS port
rg -n "kind.*Service|dynamo-platform-nats" --type yaml -A 5 | grep -A 10 "Service"
# Also search for port 4222 in manifests
rg -n "port.*4222|4222.*port" --type yamlRepository: NVIDIA/aicr
Length of output: 14824
🏁 Script executed:
# Search for NATS configuration in Helm charts or values files
fd -e yaml -e yml | xargs grep -l "nats" | head -5Repository: NVIDIA/aicr
Length of output: 252
🌐 Web query:
busybox 1.36 docker image official latest 2024 2025
💡 Result:
The official Docker image for BusyBox is maintained by the Docker Community under the "Docker Official Images" program [1][2]. As of June 15, 2026, the official busybox repository on Docker Hub does not include specific tags for version 1.36 [3][4][5]. The Docker official busybox image tags generally track newer releases. The repository currently supports versions including 1.37 and 1.38 (along with various glibc, musl, and uclibc variants) [6][5]. Version 1.36, which was released by the upstream BusyBox project on January 3, 2023 (with a 1.36.1 stable release on May 19, 2023) [7][8][9], has been superseded in the official Docker image library by these later versions [6][5]. If you require version 1.36 for a specific project, you would need to use a historical image (if available in your local registry) or build a custom Docker image using the BusyBox 1.36 source code, which remains available from the official BusyBox project website [7][8]. The "latest" tag in the official busybox repository currently points to the most recent supported version, which is 1.38.0 [3][5].
Citations:
- 1: https://github.com/docker-library/busybox
- 2: https://hub.docker.com/_/busybox?xk=ShowRecommendedBadge&xt=Enabled
- 3: https://hub.docker.com/_/busybox?tab=tags
- 4: https://hub.docker.com/_/busybox/tags
- 5: https://github.com/docker-library/repo-info/blob/master/repos/busybox/tag-details.md
- 6: https://github.com/docker-library/official-images/blob/master/library/busybox
- 7: https://launchpad.net/busybox/+download
- 8: https://busybox.net/
- 9: https://www.phoronix.com/news/BusyBox-1.36-Released
🏁 Script executed:
# Check the Dynamo platform values for NATS configuration and port
cat recipes/components/dynamo-platform/values.yaml | grep -A 10 -B 2 "nats\|4222\|port" | head -60Repository: NVIDIA/aicr
Length of output: 928
🏁 Script executed:
# Also check the registry entry for dynamo-platform
grep -A 20 "dynamo-platform:" recipes/registry.yamlRepository: NVIDIA/aicr
Length of output: 37
🏁 Script executed:
# Search for NATS Helm chart files in the repo to find port configuration
fd -type f -name "*.yaml" -o -name "*.yml" | xargs grep -l "nats.*service\|service.*nats" 2>/dev/null | head -5Repository: NVIDIA/aicr
Length of output: 226
🏁 Script executed:
# Look for NATS port configuration more broadly
rg -n "nats.*4222|4222.*nats" --type yamlRepository: NVIDIA/aicr
Length of output: 37
🏁 Script executed:
# Check if there's any documentation about the port in the file itself
cat docs/integrator/eks-dynamo-networking.md | head -100Repository: NVIDIA/aicr
Length of output: 5003
Update the busybox image tag from 1.36 to a currently available version.
The kubectl verification command references busybox:1.36, which is no longer available on Docker Hub (as of June 2026). Version 1.36 was released in January 2023 and has been superseded; the official Docker repository currently provides 1.37, 1.38, and later versions, with 1.38.0 as the latest.
Change --image=busybox:1.36 to --image=busybox:latest or --image=busybox:1.38 to ensure the command works for users.
The service name dynamo-platform-nats and namespace dynamo-system are correct per the codebase.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/integrator/eks-dynamo-networking.md` around lines 29 - 38, The kubectl
verification command in the shell script references an outdated busybox image
tag (1.36) that is no longer available on Docker Hub. Update the --image
parameter from busybox:1.36 to either busybox:latest or busybox:1.38 (the
current latest stable version) to ensure the command works for users. This
change ensures the probe pod can be successfully pulled and executed when
running the network reachability verification from a GPU node.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/integrator/eks-dynamo-networking.md (1)
35-35:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate busybox image tag to a currently available version.
The kubectl verification command on line 35 references
busybox:1.36, which is no longer available on Docker Hub. Version 1.36 was released in January 2023 and has been superseded; the official Docker repository currently provides 1.37, 1.38.0, and later versions.Change
--image=busybox:1.36to--image=busybox:latest(to use the current stable version) or pin to--image=busybox:1.38for reproducibility. This ensures the probe pod can be successfully pulled and executed when users run the network reachability verification from a GPU node.🔧 Proposed fix
kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \ kubectl run nats-probe --rm -i --restart=Never --image=busybox:latest \Or pin to a stable version:
kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \ kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.38 \🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/integrator/eks-dynamo-networking.md` at line 35, Update the busybox image tag in the kubectl run command for the nats-probe pod verification. Change the `--image=busybox:1.36` parameter to either `--image=busybox:latest` for the current stable version or `--image=busybox:1.38` for reproducibility. Version 1.36 is no longer available on Docker Hub and will cause image pull failures when users execute this network reachability verification command.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@docs/integrator/eks-dynamo-networking.md`:
- Line 35: Update the busybox image tag in the kubectl run command for the
nats-probe pod verification. Change the `--image=busybox:1.36` parameter to
either `--image=busybox:latest` for the current stable version or
`--image=busybox:1.38` for reproducibility. Version 1.36 is no longer available
on Docker Hub and will cause image pull failures when users execute this network
reachability verification command.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d4da69bc-9661-44e9-9567-21fb6504bf2e
📒 Files selected for processing (2)
docs/integrator/eks-dynamo-networking.mddocs/user/validation.md
| ```shell | ||
| kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \ | ||
| --overrides='{"spec":{"nodeSelector":{"<gpu-node-label-key>":"<value>"},"tolerations":[{"operator":"Exists"}]}}' \ | ||
| -- sh -c 'nc -zv -w 5 dynamo-platform-nats.dynamo-system.svc.cluster.local 4222' |
There was a problem hiding this comment.
could be part of the automated validation, wdyt?
Summary
Extend
docs/integrator/eks-dynamo-networking.mdwith the failure signature actually observed when the GPU→system security group blocks NATS4222after the Dynamo 1.2 bump. The required SG rules (4222 + 9090) were already documented; this adds the symptoms and a quick reachability probe so the failure is recognizable.Motivation / Context
Dynamo 1.2 (#1308) switched discovery to the NATS event plane (TCP 4222). On DGXC EKS clusters whose system-node SG only allowlisted the pre-1.2 control-plane ports (e.g. 27017/6379/8080), 4222 is closed, so a bundle that worked on Dynamo 1.0.x starts failing purely from the version bump. The crash presents as a workload bug (CrashLoopBackOff) rather than a network/SG issue, which cost real debugging time. Reproduced on GB200 (yljtrxpmzu) and H100 (aicr3) clusters; the same rc2 bundle runs fine on GKE (flat network).
Related: #1308
Type of Change
Components Affected
docs/)Implementation Notes
Failed to connect to NATS: timed out, workerCrashLoopBackOff, startup-probeconnection refused, and aninference-perftimeout while deployment/conformance pass.ncreachability probe from a GPU node.Testing
Doc-only change. Verified Markdown renders; no headings renamed/removed (no anchor-link impact). Full
make qualifynot run — tests/e2e/Go-lint cannot regress from a docs-only edit; CI lychee covers link checks.Risk Assessment
Minimal — additive documentation only.
Checklist