Skip to content

Conversation

@vinhngx
Copy link

@vinhngx vinhngx commented Jan 13, 2026

What does this PR do ?

Adding k8 setup and job execution guide

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Documentation
    • Added comprehensive Kubernetes deployment guide including prerequisites, container management, shared storage setup, Ray cluster configuration with detailed YAML examples, job submission procedures, and monitoring for NVIDIA GPU-accelerated NemoRL training.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: vinhn <vinhn@nvidia.com>
@vinhngx vinhngx requested a review from a team as a code owner January 13, 2026 05:32
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 13, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

Documentation update that replaces a Kubernetes section placeholder with a comprehensive guide covering NemoRL training job deployment on Kubernetes using Ray with NVIDIA GPUs, including setup, configuration, and operational procedures.

Changes

Cohort / File(s) Summary
Documentation
docs/cluster.md
Replaced TBD Kubernetes section with detailed migration guide including prerequisites, cluster setup phases (storage, Ray cluster, workload deployment), YAML configurations, management commands, and monitoring procedures

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

documentation

Suggested reviewers

  • lbliii
  • terrykong
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed Pull request contains documentation-only changes adding a Kubernetes deployment guide, which does not affect functionality or require testing.
Title check ✅ Passed The title 'docs: Adding k8 guide' accurately describes the main change: adding Kubernetes documentation to the repository.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @docs/cluster.md:
- Line 331: The manifest's image field is hardcoded to
nvcr.io/nvidian/nemo-rl:latest which will mismatch the build tag
nvcr.io/${NGC_ORG}/nemo-rl:latest and cause ImagePullBackOff; update the image
entry to use the same placeholder used during build (e.g.,
nvcr.io/${NGC_ORG}/nemo-rl:latest or a clear placeholder like
<YOUR_NGC_ORG>/nemo-rl:latest) and add a short IMPORTANT note above the YAML
telling users to replace <YOUR_NGC_ORG> (or set NGC_ORG) prior to applying, or
alternatively document using envsubst for variable substitution so the
deployment image matches the built image.
- Line 325: YAML contains markdown-style resource names like
"[nvidia.com/gpu](https://nvidia.com/gpu)" which is invalid; replace each
occurrence with the plain resource name nvidia.com/gpu (e.g., change the key
from "[nvidia.com/gpu](https://nvidia.com/gpu)" to "nvidia.com/gpu") in the
cluster and worker specs and apply the same replacement for all other
occurrences of the markdown link form in the file.
- Line 306: The cluster config pins rayVersion: '2.49.2', which has a critical
CVE (ShadowRay); update the version string (rayVersion) to '2.52.0' or later in
the cluster configuration and any other places that reference rayVersion (e.g.,
docs/cluster.md and matched entries in pyproject.toml or deployment manifests),
or alternatively add notes/instructions to enforce strict network/API access
controls for the jobs/dashboard if you cannot upgrade; ensure consistency across
all files referencing the rayVersion symbol.
🧹 Nitpick comments (2)
docs/cluster.md (2)

346-346: Network interface configuration requires validation.

The bond0 interface (lines 346, 412) is not universal across all Kubernetes clusters. While line 278 mentions checking with the admin, users might miss this note and experience NCCL communication failures that are difficult to debug.

💡 Add more prominent configuration note

Consider adding a more visible warning directly in the YAML comments:

           env:
             - name: NVIDIA_VISIBLE_DEVICES
               value: "all"
+            # IMPORTANT: Verify the correct network interface with your cluster admin
+            # Common values: bond0, eth0, ib0 (for InfiniBand)
+            # Run 'ip addr' or 'ifconfig' on a node to identify available interfaces
             - name: NCCL_SOCKET_IFNAME
               value: bond0
             - name: NCCL_SHM_DISABLE

Also applies to: 412-412


203-203: Consider using proper heading for better document structure.

Line 203 uses bold emphasis for "Login to the Registry" which could be a proper heading (e.g., #### Login to the Registry) for better document structure and navigation.

📝 Convert to proper heading
 ### 2. Build and Push the Docker Container
 We will use the NVIDIA cloud registry (`nvcr.io`) for this guide. From your client machine:
 
-**Login to the Registry**
+#### Login to the Registry
 ```bash
 # Set up Docker and nvcr.io with your NGC_API_KEY
 docker login nvcr.io

Apply similar changes to "Build and Push" on line 212 if desired.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2e4265 and 8d0564d.

📒 Files selected for processing (1)
  • docs/cluster.md
🧰 Additional context used
📓 Path-based instructions (2)
docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section

Files:

  • docs/cluster.md
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • docs/cluster.md
🪛 LanguageTool
docs/cluster.md

[style] ~235-~235: ‘exact same’ might be wordy. Consider a shorter alternative.
Context: ... the Head node and Worker nodes see the exact same files (code, data, checkpoints). This p...

(EN_WORDINESS_PREMIUM_EXACT_SAME)

🪛 markdownlint-cli2 (0.18.1)
docs/cluster.md

203-203: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
docs/cluster.md (2)

517-560: Helpful utility for PVC debugging.

The busybox helper pod is a practical addition that allows users to inspect and manage PVC contents without spinning up expensive GPU pods. The implementation is clean and the usage examples are clear.


1-3: The docs/index.md file has been properly updated with the cluster setup guide. The new docs/cluster.md document is referenced in the "Environment Start" section of the index and appears in the grid card for cluster setup under "Training and Generation," satisfying the coding guideline requirement.

Signed-off-by: vinhn <vinhn@nvidia.com>
@vinhngx vinhngx changed the title [Doc] Adding k8 guide docs Adding k8 guide Jan 13, 2026
@vinhngx vinhngx changed the title docs Adding k8 guide docs: Adding k8 guide Jan 13, 2026
vinhngx and others added 2 commits January 13, 2026 11:35
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 15, 2026
@shashank3959 shashank3959 requested a review from lbliii January 16, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants