Skip to content

docs(integrator): document Dynamo 1.2 NATS SG symptoms on EKS#1369

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:docs/eks-dynamo-nats-symptoms
Jun 15, 2026
Merged

docs(integrator): document Dynamo 1.2 NATS SG symptoms on EKS#1369
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:docs/eks-dynamo-nats-symptoms

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Extend docs/integrator/eks-dynamo-networking.md with the failure signature actually observed when the GPU→system security group blocks NATS 4222 after the Dynamo 1.2 bump. The required SG rules (4222 + 9090) were already documented; this adds the symptoms and a quick reachability probe so the failure is recognizable.

Motivation / Context

Dynamo 1.2 (#1308) switched discovery to the NATS event plane (TCP 4222). On DGXC EKS clusters whose system-node SG only allowlisted the pre-1.2 control-plane ports (e.g. 27017/6379/8080), 4222 is closed, so a bundle that worked on Dynamo 1.0.x starts failing purely from the version bump. The crash presents as a workload bug (CrashLoopBackOff) rather than a network/SG issue, which cost real debugging time. Reproduced on GB200 (yljtrxpmzu) and H100 (aicr3) clusters; the same rc2 bundle runs fine on GKE (flat network).

Related: #1308

Type of Change

  • Documentation update

Components Affected

  • Docs/examples (docs/)

Implementation Notes

  • Adds a note that the 4222 requirement is new as of the Dynamo 1.2 NATS switch.
  • Expands the symptom list: frontend Failed to connect to NATS: timed out, worker CrashLoopBackOff, startup-probe connection refused, and an inference-perf timeout while deployment/conformance pass.
  • Adds a one-line busybox nc reachability probe from a GPU node.

Testing

Doc-only change. Verified Markdown renders; no headings renamed/removed (no anchor-link impact). Full make qualify not run — tests/e2e/Go-lint cannot regress from a docs-only edit; CI lychee covers link checks.

Risk Assessment

Minimal — additive documentation only.

Checklist

  • Docs updated in the same PR as the behavior they describe
  • No code changes

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 15, 2026 18:38
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The EKS Dynamo networking prerequisites document (docs/integrator/eks-dynamo-networking.md) adds 37 net lines documenting Dynamo 1.2 NATS requirements and verification procedures. New content includes: a Dynamo 1.2–specific explanation that NATS port 4222 must be allowlisted because KV-cache events now route through the NATS-backed event plane, while DYN_REQUEST_PLANE inference traffic remains on TCP; an expanded "typical symptoms" section detailing crash loops, probe failures, and performance-validator timeouts when NATS is unreachable; a concrete kubectl run BusyBox verification command using nc to test TCP connectivity from GPU nodes to the dynamo-platform NATS service on port 4222; and revised Prometheus conformance-validator prerequisites documenting the ai-service-metrics dial requirement, orchestrator Job tolerations, preferred dependencyAffinity placement, and the trade-off that fallback placement remains non-deterministic while the 9090 security group ingress rule provides the reliable cluster-side guarantee. Updated security group rules guidance clarifies that TCP 9090 must be reachable from any worker node group capable of hosting the orchestrator, including cross-subnet scenarios. The validation troubleshooting guide (docs/user/validation.md) is updated to reflect the new preferred affinity behavior and emphasize that security group ingress rules remain the reliable guarantee.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'docs(integrator): document Dynamo 1.2 NATS SG symptoms on EKS' directly captures the main objective of the PR: documenting failure symptoms for NATS security group issues in Dynamo 1.2 on EKS.
Description check ✅ Passed The description comprehensively explains the context, motivation, and specific changes made to the documentation, including the Dynamo 1.2 NATS switch, failure symptoms, and diagnostic probe.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 marked this pull request as draft June 15, 2026 18:51
@yuanchen8911 yuanchen8911 force-pushed the docs/eks-dynamo-nats-symptoms branch from ff14297 to 0b698ca Compare June 15, 2026 18:53
…-metrics affinity wording

The EKS Dynamo networking guide already specifies the required NATS (4222)
and Prometheus (9090) security-group rules, but the symptom list only
mentioned 'JetStream not available'. Add the failure signature actually
observed when the GPU->system SG blocks 4222 after the Dynamo 1.2 bump
(which switched discovery to the NATS event plane): frontend/worker
CrashLoopBackOff with 'Failed to connect to NATS: timed out', and the
inference-perf validator failing (~15 min, after the workload-ready/health
gates lapse) while deployment/conformance pass. Add a busybox reachability
probe (with a catch-all toleration so it schedules on tainted GPU nodes).

Also refresh the stale ai-service-metrics/Prometheus scheduling notes in both
eks-dynamo-networking.md and user/validation.md: the orchestrator now sets a
preferred dependencyAffinity toward Prometheus (issue NVIDIA#933, resolved). Reword
the 'no node-affinity / no scheduling preference' claims as best-effort
fallback behavior; the 9090 SG rule remains the reliable guarantee.
@yuanchen8911 yuanchen8911 force-pushed the docs/eks-dynamo-nats-symptoms branch from 0b698ca to b2bd003 Compare June 15, 2026 18:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/eks-dynamo-networking.md`:
- Around line 29-38: The kubectl verification command in the shell script
references an outdated busybox image tag (1.36) that is no longer available on
Docker Hub. Update the --image parameter from busybox:1.36 to either
busybox:latest or busybox:1.38 (the current latest stable version) to ensure the
command works for users. This change ensures the probe pod can be successfully
pulled and executed when running the network reachability verification from a
GPU node.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 146b6626-4819-411b-a8f2-151af8bad53b

📥 Commits

Reviewing files that changed from the base of the PR and between ff14297 and 0b698ca.

📒 Files selected for processing (1)
  • docs/integrator/eks-dynamo-networking.md

Comment on lines +29 to +38
You can confirm reachability directly from a GPU node before re-running. The
toleration is required because the GPU node groups on these clusters are
tainted (`NoSchedule`/`NoExecute`); without it the probe pod stays `Pending`
and never runs:

```shell
kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \
--overrides='{"spec":{"nodeSelector":{"<gpu-node-label-key>":"<value>"},"tolerations":[{"operator":"Exists"}]}}' \
-- sh -c 'nc -zv -w 5 dynamo-platform-nats.dynamo-system.svc.cluster.local 4222'
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for the NATS service definition in the Dynamo Helm chart or manifests
rg -n "dynamo-platform.*nats" --type yaml

# Or search for services in the target namespace
rg -n "namespace.*dynamo" --type yaml

Repository: NVIDIA/aicr

Length of output: 2108


🏁 Script executed:

# Look at the actual service definition in the assert-dynamo files
cat tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml | head -50

Repository: NVIDIA/aicr

Length of output: 1796


🏁 Script executed:

# Check for port 4222 references in NATS service definitions
rg -n "4222|nats.*port" --type yaml -B 2 -A 2

Repository: NVIDIA/aicr

Length of output: 37


🏁 Script executed:

# Verify busybox image tag availability and appropriateness
web_search
busybox 1.36 alpine docker image latest version 2024

Repository: NVIDIA/aicr

Length of output: 152


🏁 Script executed:

# Search for the Service definition that exposes the NATS port
rg -n "kind.*Service|dynamo-platform-nats" --type yaml -A 5 | grep -A 10 "Service"

# Also search for port 4222 in manifests
rg -n "port.*4222|4222.*port" --type yaml

Repository: NVIDIA/aicr

Length of output: 14824


🏁 Script executed:

# Search for NATS configuration in Helm charts or values files
fd -e yaml -e yml | xargs grep -l "nats" | head -5

Repository: NVIDIA/aicr

Length of output: 252


🌐 Web query:

busybox 1.36 docker image official latest 2024 2025

💡 Result:

The official Docker image for BusyBox is maintained by the Docker Community under the "Docker Official Images" program [1][2]. As of June 15, 2026, the official busybox repository on Docker Hub does not include specific tags for version 1.36 [3][4][5]. The Docker official busybox image tags generally track newer releases. The repository currently supports versions including 1.37 and 1.38 (along with various glibc, musl, and uclibc variants) [6][5]. Version 1.36, which was released by the upstream BusyBox project on January 3, 2023 (with a 1.36.1 stable release on May 19, 2023) [7][8][9], has been superseded in the official Docker image library by these later versions [6][5]. If you require version 1.36 for a specific project, you would need to use a historical image (if available in your local registry) or build a custom Docker image using the BusyBox 1.36 source code, which remains available from the official BusyBox project website [7][8]. The "latest" tag in the official busybox repository currently points to the most recent supported version, which is 1.38.0 [3][5].

Citations:


🏁 Script executed:

# Check the Dynamo platform values for NATS configuration and port
cat recipes/components/dynamo-platform/values.yaml | grep -A 10 -B 2 "nats\|4222\|port" | head -60

Repository: NVIDIA/aicr

Length of output: 928


🏁 Script executed:

# Also check the registry entry for dynamo-platform
grep -A 20 "dynamo-platform:" recipes/registry.yaml

Repository: NVIDIA/aicr

Length of output: 37


🏁 Script executed:

# Search for NATS Helm chart files in the repo to find port configuration
fd -type f -name "*.yaml" -o -name "*.yml" | xargs grep -l "nats.*service\|service.*nats" 2>/dev/null | head -5

Repository: NVIDIA/aicr

Length of output: 226


🏁 Script executed:

# Look for NATS port configuration more broadly
rg -n "nats.*4222|4222.*nats" --type yaml

Repository: NVIDIA/aicr

Length of output: 37


🏁 Script executed:

# Check if there's any documentation about the port in the file itself
cat docs/integrator/eks-dynamo-networking.md | head -100

Repository: NVIDIA/aicr

Length of output: 5003


Update the busybox image tag from 1.36 to a currently available version.

The kubectl verification command references busybox:1.36, which is no longer available on Docker Hub (as of June 2026). Version 1.36 was released in January 2023 and has been superseded; the official Docker repository currently provides 1.37, 1.38, and later versions, with 1.38.0 as the latest.

Change --image=busybox:1.36 to --image=busybox:latest or --image=busybox:1.38 to ensure the command works for users.

The service name dynamo-platform-nats and namespace dynamo-system are correct per the codebase.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/eks-dynamo-networking.md` around lines 29 - 38, The kubectl
verification command in the shell script references an outdated busybox image
tag (1.36) that is no longer available on Docker Hub. Update the --image
parameter from busybox:1.36 to either busybox:latest or busybox:1.38 (the
current latest stable version) to ensure the command works for users. This
change ensures the probe pod can be successfully pulled and executed when
running the network reachability verification from a GPU node.

@github-actions github-actions Bot added size/M and removed size/S labels Jun 15, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 15, 2026 18:58
@mchmarny mchmarny merged commit 180a596 into NVIDIA:main Jun 15, 2026
29 of 30 checks passed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/integrator/eks-dynamo-networking.md (1)

35-35: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update busybox image tag to a currently available version.

The kubectl verification command on line 35 references busybox:1.36, which is no longer available on Docker Hub. Version 1.36 was released in January 2023 and has been superseded; the official Docker repository currently provides 1.37, 1.38.0, and later versions.

Change --image=busybox:1.36 to --image=busybox:latest (to use the current stable version) or pin to --image=busybox:1.38 for reproducibility. This ensures the probe pod can be successfully pulled and executed when users run the network reachability verification from a GPU node.

🔧 Proposed fix
  kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \
  kubectl run nats-probe --rm -i --restart=Never --image=busybox:latest \

Or pin to a stable version:

  kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \
  kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.38 \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/integrator/eks-dynamo-networking.md` at line 35, Update the busybox
image tag in the kubectl run command for the nats-probe pod verification. Change
the `--image=busybox:1.36` parameter to either `--image=busybox:latest` for the
current stable version or `--image=busybox:1.38` for reproducibility. Version
1.36 is no longer available on Docker Hub and will cause image pull failures
when users execute this network reachability verification command.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@docs/integrator/eks-dynamo-networking.md`:
- Line 35: Update the busybox image tag in the kubectl run command for the
nats-probe pod verification. Change the `--image=busybox:1.36` parameter to
either `--image=busybox:latest` for the current stable version or
`--image=busybox:1.38` for reproducibility. Version 1.36 is no longer available
on Docker Hub and will cause image pull failures when users execute this network
reachability verification command.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d4da69bc-9661-44e9-9567-21fb6504bf2e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b698ca and b2bd003.

📒 Files selected for processing (2)
  • docs/integrator/eks-dynamo-networking.md
  • docs/user/validation.md

```shell
kubectl run nats-probe --rm -i --restart=Never --image=busybox:1.36 \
--overrides='{"spec":{"nodeSelector":{"<gpu-node-label-key>":"<value>"},"tolerations":[{"operator":"Exists"}]}}' \
-- sh -c 'nc -zv -w 5 dynamo-platform-nats.dynamo-system.svc.cluster.local 4222'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be part of the automated validation, wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants