-
Notifications
You must be signed in to change notification settings - Fork 219
docs: Adding k8 guide #1764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: Adding k8 guide #1764
Conversation
Signed-off-by: vinhn <vinhn@nvidia.com>
📝 WalkthroughWalkthroughDocumentation update that replaces a Kubernetes section placeholder with a comprehensive guide covering NemoRL training job deployment on Kubernetes using Ray with NVIDIA GPUs, including setup, configuration, and operational procedures. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @docs/cluster.md:
- Line 331: The manifest's image field is hardcoded to
nvcr.io/nvidian/nemo-rl:latest which will mismatch the build tag
nvcr.io/${NGC_ORG}/nemo-rl:latest and cause ImagePullBackOff; update the image
entry to use the same placeholder used during build (e.g.,
nvcr.io/${NGC_ORG}/nemo-rl:latest or a clear placeholder like
<YOUR_NGC_ORG>/nemo-rl:latest) and add a short IMPORTANT note above the YAML
telling users to replace <YOUR_NGC_ORG> (or set NGC_ORG) prior to applying, or
alternatively document using envsubst for variable substitution so the
deployment image matches the built image.
- Line 325: YAML contains markdown-style resource names like
"[nvidia.com/gpu](https://nvidia.com/gpu)" which is invalid; replace each
occurrence with the plain resource name nvidia.com/gpu (e.g., change the key
from "[nvidia.com/gpu](https://nvidia.com/gpu)" to "nvidia.com/gpu") in the
cluster and worker specs and apply the same replacement for all other
occurrences of the markdown link form in the file.
- Line 306: The cluster config pins rayVersion: '2.49.2', which has a critical
CVE (ShadowRay); update the version string (rayVersion) to '2.52.0' or later in
the cluster configuration and any other places that reference rayVersion (e.g.,
docs/cluster.md and matched entries in pyproject.toml or deployment manifests),
or alternatively add notes/instructions to enforce strict network/API access
controls for the jobs/dashboard if you cannot upgrade; ensure consistency across
all files referencing the rayVersion symbol.
🧹 Nitpick comments (2)
docs/cluster.md (2)
346-346: Network interface configuration requires validation.The
bond0interface (lines 346, 412) is not universal across all Kubernetes clusters. While line 278 mentions checking with the admin, users might miss this note and experience NCCL communication failures that are difficult to debug.💡 Add more prominent configuration note
Consider adding a more visible warning directly in the YAML comments:
env: - name: NVIDIA_VISIBLE_DEVICES value: "all" + # IMPORTANT: Verify the correct network interface with your cluster admin + # Common values: bond0, eth0, ib0 (for InfiniBand) + # Run 'ip addr' or 'ifconfig' on a node to identify available interfaces - name: NCCL_SOCKET_IFNAME value: bond0 - name: NCCL_SHM_DISABLEAlso applies to: 412-412
203-203: Consider using proper heading for better document structure.Line 203 uses bold emphasis for "Login to the Registry" which could be a proper heading (e.g.,
#### Login to the Registry) for better document structure and navigation.📝 Convert to proper heading
### 2. Build and Push the Docker Container We will use the NVIDIA cloud registry (`nvcr.io`) for this guide. From your client machine: -**Login to the Registry** +#### Login to the Registry ```bash # Set up Docker and nvcr.io with your NGC_API_KEY docker login nvcr.ioApply similar changes to "Build and Push" on line 212 if desired.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/cluster.md
🧰 Additional context used
📓 Path-based instructions (2)
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section
Files:
docs/cluster.md
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
docs/cluster.md
🪛 LanguageTool
docs/cluster.md
[style] ~235-~235: ‘exact same’ might be wordy. Consider a shorter alternative.
Context: ... the Head node and Worker nodes see the exact same files (code, data, checkpoints). This p...
(EN_WORDINESS_PREMIUM_EXACT_SAME)
🪛 markdownlint-cli2 (0.18.1)
docs/cluster.md
203-203: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
docs/cluster.md (2)
517-560: Helpful utility for PVC debugging.The busybox helper pod is a practical addition that allows users to inspect and manage PVC contents without spinning up expensive GPU pods. The implementation is clean and the usage examples are clear.
1-3: Thedocs/index.mdfile has been properly updated with the cluster setup guide. The newdocs/cluster.mddocument is referenced in the "Environment Start" section of the index and appears in the grid card for cluster setup under "Training and Generation," satisfying the coding guideline requirement.
Signed-off-by: vinhn <vinhn@nvidia.com>
Signed-off-by: vinhn <vinhn@nvidia.com>
What does this PR do ?
Adding k8 setup and job execution guide
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.