Skip to content

WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336

Draft
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays-v2
Draft

WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)#1336
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:feat/gb300-eks-overlays-v2

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

WIP / Draft. Re-files #1319 (reverted in #1328) so the GB300 EKS recipe work isn't lost. Do not merge until the issues below are resolved and GB300 is validated on a clean cluster.

Summary

Re-introduces the GB300 EKS service-bound overlays + accelerator plumbing from #1319 (which was reverted in #1328 because GB300 wasn't ready). This PR is the working branch to land GB300 once the open issues are fixed.

Open issues blocking this

Issues filed. The first one is a genuine bug (P1). WIP, and should land for the next release.

Status

Restores the #1319 content verbatim (overlays, gb300 accelerator enum, NCCL validator wiring, OpenAPI/CLI/docs). The fixes for the issues above will be added as follow-up commits, plus a clean-GB300-cluster validation, before this is taken out of draft.

Tracking issue: #1318 (reopened after the revert). Reverted by: #1328.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Affected leaf overlays: 6

Recipe Pointer Verify Digest match
gb300-eks-inference ⚠️ missing
gb300-eks-training ⚠️ missing
gb300-eks-ubuntu-inference-dynamo ⚠️ missing
gb300-eks-ubuntu-inference ⚠️ missing
gb300-eks-ubuntu-training-kubeflow ⚠️ missing
gb300-eks-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds gb300 support end-to-end: a new exported CriteriaAcceleratorGB300 and parsing/listing, GB300 SKU matching in fingerprinting, OpenAPI/CLI/docs updates to allow gb300, seven GB300 RecipeMetadata overlays and overlay-test coverage, NCCL/performance validator changes (template aliasing, combination matrix and preflight applicability), and pinned Dynamo vLLM image tag updates in validator fixtures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1319: Parallel/overlapping implementation of GB300 support (types, SKU matching, validators, overlays).
  • NVIDIA/aicr#1328: Reverts/removes GB300 codepaths and overlays that overlap with the additions in this PR.
  • NVIDIA/aicr#1308: Overlaps on Dynamo/vLLM runtime image/tag updates used by validator fixtures.

Suggested labels

area/tests, theme/recipes

Suggested reviewers

  • xdu31
  • ayuskauskas
  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'WIP: feat(recipes): GB300 EKS service-bound overlays (re-file of #1319)' clearly summarizes the main change: re-introducing GB300 EKS overlays and related plumbing previously reverted.
Description check ✅ Passed The description is directly related to the changeset, explaining the re-introduction of GB300 EKS overlays, open blocking issues, and reasons for WIP status.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
recipes/overlays/gb300-eks-training.yaml (1)

121-131: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

GB300/EKS still encodes DRA as the required allocation model across recipes/overlays/gb300-eks-training.yaml and recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml. The training recipe enforces dra-support, and the Dynamo leaf hard-codes nvidia-dra-driver-gpu plus the DRA-driven >= 1.34 floor. That shared root cause conflicts with the PR’s own open blockers (#1327/#1326) about GB300/EKS allocation and networking, so these overlays will validate the wrong deployment shape until the device-plugin-vs-DRA decision is resolved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/gb300-eks-training.yaml` around lines 121 - 131, The overlay
still forces DRA by listing the conformance check "dra-support" and hard-coding
the Dynamo leaf to use "nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or
gate the "dra-support" entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.
validators/performance/nccl_all_reduce_bw_constraint.go (1)

138-169: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't bake GB300 into the GB200/EFA NCCL path yet.

Lines 138-169 hard-code GB300 onto the GB200 templates and advertise EKS NET/NVLS support as if the transport/runtime shape were identical. That conflicts with the PR objective's open blocker #1326, which explicitly calls out that GB300 on AWS is RoCE rather than EFA. As written, this can route GB300 validation through the wrong runtime assets and also triggers the GB200-specific NVreg preflight downstream in validators/performance/nccl_preflight_nvreg.go. Either keep GB300 out of the EKS NET/NVLS matrix until that blocker is closed, or add GB300-specific templates/preflight once the transport is validated.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/nccl_all_reduce_bw_constraint.go` around lines 138 -
169, Remove the temporary GB300-to-GB200 alias and any advertised GB300 support
in the NCCL combinations until AWS RoCE behavior is resolved: stop mapping
accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@recipes/overlays/gb300-eks-training.yaml`:
- Around line 121-131: The overlay still forces DRA by listing the conformance
check "dra-support" and hard-coding the Dynamo leaf to use
"nvidia-dra-driver-gpu" with a ">= 1.34" floor; remove or gate the "dra-support"
entry from the conformance.checks array in
recipes/overlays/gb300-eks-training.yaml and remove the hard-coded
"nvidia-dra-driver-gpu" and its ">= 1.34" version constraint from
recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml (or replace both with a
feature-flag/conditional that defers to the device-plugin vs DRA decision), so
the overlays no longer validate deployments assuming DRA until the
device-plugin/DRA decision in PRs `#1327/`#1326 is resolved.

In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 138-169: Remove the temporary GB300-to-GB200 alias and any
advertised GB300 support in the NCCL combinations until AWS RoCE behavior is
resolved: stop mapping accelerator == recipe.CriteriaAcceleratorGB300 to
recipe.CriteriaAcceleratorGB200 in the templatePath logic (the block that
mutates the accelerator variable before returning filepath.Join) and remove
recipe.CriteriaAcceleratorGB300 from the lists under variantNET and variantNVLS
in the supportedNCCLCombinations map so GB300 is not listed for
recipe.CriteriaServiceEKS; keep GB300 out of any GB200-specific NVReg/preflight
paths until separate GB300 templates/preflight are added.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 338fc99e-d81b-4381-93f2-0049b29848de

📥 Commits

Reviewing files that changed from the base of the PR and between e6d9301 and 9ad2d74.

📒 Files selected for processing (27)
  • .claude/skills/aicr-analyzing-snapshots/SKILL.md
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/client/v1/types.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/gpu_sku.go
  • pkg/fingerprint/gpu_sku_test.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/recipe/metadata_test.go
  • recipes/overlays/gb300-any.yaml
  • recipes/overlays/gb300-eks-inference.yaml
  • recipes/overlays/gb300-eks-training.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/gb300-eks-ubuntu-inference.yaml
  • recipes/overlays/gb300-eks-ubuntu-training-kubeflow.yaml
  • recipes/overlays/gb300-eks-ubuntu-training.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/nccl_preflight_nvreg_test.go
  • validators/performance/nccl_test.go

The default vllm-runtime:1.2.0 is a CUDA 12.9 build whose flashinfer
kernels lack sm_103 (GB300/Blackwell Ultra), so the inference-perf worker
crash-loops with "no kernel image is available for execution on the
device". Switch the inference workload + model-cache populate image to
vllm-runtime:1.2.0-cuda13 (CUDA 13.0), which covers the Blackwell family.

Verified end-to-end: GB300 (sm_103) and RTX Pro 6000 (sm_120) both serve
Qwen3-8B and pass the inference-perf gate with the cuda13 image.

Refs NVIDIA#1318

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/model_cache.go`:
- Line 95: The code pins cacheWorkerImage to a CUDA 13 vLLM runtime
unconditionally which can break model-cache population on older GPU
accelerators; update buildModelCachePopulateJob (and any code that references
cacheWorkerImage) to select the runtime image based on the cluster/node
accelerator type (e.g., H100, A100, GB200, GB300, H200, B200, RTX Pro) or a
configurable override, and update the worker YAML templates
(validators/performance/testdata/inference/dynamo-deployment*.yaml) to support
multiple runtime images or accept an image parameter; specifically, make
cacheWorkerImage a function or switch keyed by accelerator label, wire that into
ensureModelCache which creates the populate Job, and allow an env/flag to force
a particular image for testing so older device generations use compatible CUDA
runtimes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c03a6f6f-b82b-459c-859e-63ae0caa61f3

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad2d74 and a0e4f2c.

📒 Files selected for processing (3)
  • validators/performance/model_cache.go
  • validators/performance/testdata/inference/dynamo-deployment-gateway-epp.yaml
  • validators/performance/testdata/inference/dynamo-deployment.yaml

// registry parity alone is not air-gap support: the populate Job's
// snapshot_download still reaches huggingface.co for the weights.
cacheWorkerImage = "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0"
cacheWorkerImage = "nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-cuda13"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all inference recipes and their targeted accelerators to assess
# the blast radius of the CUDA 13 image update.

echo "=== Inference Recipes Affected by vLLM Runtime Update ==="
echo ""

# Find all recipes with inference in the name
echo "Searching for inference recipes in overlays..."
fd -e yaml 'inference' recipes/overlays/ | sort

echo ""
echo "=== Accelerator Types in Inference Recipes ==="
echo ""

# Check which accelerators are mentioned in those recipes
for recipe in $(fd -e yaml 'inference' recipes/overlays/ | sort); do
  echo "Recipe: $(basename $recipe)"
  rg -o 'accelerator:\s*\w+' "$recipe" 2>/dev/null | head -1 || echo "  (no accelerator field found)"
  echo ""
done

echo "=== CUDA 13 Compute Capability Coverage ==="
echo "CUDA 13 should support compute capabilities sm_50 and above."
echo "Common AICR accelerators:"
echo "  - A100:  sm_80"
echo "  - H100:  sm_90"
echo "  - GB200: sm_90"
echo "  - GB300: sm_103 (requires CUDA 13)"
echo ""
echo "ACTION REQUIRED: Manually verify inference-perf validation succeeds"
echo "on clusters with H100, GB200, and A100 to confirm backward compatibility"
echo "with the CUDA 13 runtime before merging this PR."

Repository: NVIDIA/aicr

Length of output: 4251


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate cacheWorkerImage usage + the function that uses it
rg -n "cacheWorkerImage|vllm-runtime:1\.2\.0-cuda13" validators/performance/model_cache.go
rg -n "buildModelCachePopulateJob" validators/performance/model_cache.go

echo "----"

# Print the relevant section around cacheWorkerImage and its call site(s)
# (keep output small via line ranges)
python3 - <<'PY'
import re, subprocess, os, textwrap, json
path="validators/performance/model_cache.go"
# get line numbers for cacheWorkerImage
out=subprocess.check_output(["rg","-n","cacheWorkerImage",path],text=True)
lns=[]
for line in out.strip().splitlines():
    m=re.match(r".*:(\d+):",line)
    if m: lns.append(int(m.group(1)))
lns=sorted(set(lns))
if not lns:
    print("No cacheWorkerImage occurrences found")
    raise SystemExit(0)
for ln in lns:
    print(f"---- context around cacheWorkerImage at line {ln} ----")
    subprocess.run(["sed","-n",f"{max(1,ln-15)},{ln+25}p",path],check=True)
PY

echo "----"

# 2) Find other references across repo (if any) to vllm-runtime:1.2.0-cuda13
rg -n "vllm-runtime:1\.2\.0-cuda13" .

echo "----"

# 3) Check the specific YAML files mentioned in the original comment, if they exist.
# Use fd to locate files by exact basename.
fd -t f "dynamo-deployment.yaml" testdata | head -n 20 || true
fd -t f "dynamo-deployment-gateway-epp.yaml" testdata | head -n 20 || true

# If found, show those line ranges
for f in $(fd -t f "dynamo-deployment.yaml" testdata | head -n 5); do
  echo "---- showing $f: lines 1-120 ----"
  sed -n '1,120p' "$f"
done
for f in $(fd -t f "dynamo-deployment-gateway-epp.yaml" testdata | head -n 5); do
  echo "---- showing $f: lines 1-160 ----"
  sed -n '1,160p' "$f"
done

echo "----"

# 4) Look for any conditional selection of CUDA runtime / image by accelerator/GPU type
# (search for "cuda" / "accelerator" / "cacheWorkerImage" in model_cache or nearby files)
rg -n "cuda1[0-9]|cuda|sm_|accelerator|compute capability" validators/performance validators | head -n 200

Repository: NVIDIA/aicr

Length of output: 26683


🏁 Script executed:

#!/bin/bash
set -euo pipefail

path_go="validators/performance/model_cache.go"

echo "=== model_cache.go: cacheWorkerImage + call site + buildModelCachePopulateJob ==="
# Show the constant and surrounding context
sed -n '70,140p' "$path_go"

echo "----"
# Show the call site around where buildModelCachePopulateJob is invoked (search hit near 221/295)
sed -n '180,260p' "$path_go"

echo "----"
# Show the whole buildModelCachePopulateJob function (line numbers around 269)
sed -n '260,360p' "$path_go"

echo "----"
# Show any other references to buildModelCachePopulateJob / cacheWorkerImage in this file
echo "Occurrences:"
rg -n "buildModelCachePopulateJob|cacheWorkerImage" "$path_go"

echo "----"
echo "=== All vllm-runtime image references in the repo ==="
rg -n "vllm-runtime:" . | head -n 200

echo "----"
echo "=== vllm-runtime references restricted to validators/performance/testdata ==="
rg -n "vllm-runtime:" validators/performance/testdata | head -n 200

Repository: NVIDIA/aicr

Length of output: 15552


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "ensureModelCache\\(" validators/performance | head -n 200
rg -n "injectModelCacheMounts\\(" validators/performance | head -n 200
rg -n "modelCacheEnabled\\(|envModelCache|MODEL_CACHE" validators/performance/model_cache.go validators/performance/inference_perf_test.go validators/performance/inference_perf_constraint.go 2>/dev/null || true

echo "---- ensureModelCache call site context ----"
# show a bit of context around each ensureModelCache call
rg -n "ensureModelCache\\(" -n validators/performance | while read -r line; do
  file="$(echo "$line" | cut -d: -f1)"
  ln="$(echo "$line" | cut -d: -f2)"
  echo "## $file:$ln"
  sed -n "$((ln-20)),$((ln+20))p" "$file"
done | head -n 400

Repository: NVIDIA/aicr

Length of output: 11271


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where the dynamo-deployment templates are referenced from Go code
rg -n "dynamo-deployment\\.yaml|dynamo-deployment-gateway-epp\\.yaml" validators/performance | head -n 200

# Also check how the inference deployment template is selected (if at all) by accelerator
rg -n "DynamoGraphDeployment|applyInference|inference.*deployment|template.*inference" validators/performance | head -n 200

Repository: NVIDIA/aicr

Length of output: 8259


Ensure CUDA 13 vLLM runtime works across all targeted GPU accelerators

validators/performance/model_cache.go pins cacheWorkerImage to nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-cuda13 and uses it unconditionally in buildModelCachePopulateJob (no accelerator/GPU-type branching). The worker templates used alongside it (validators/performance/testdata/inference/dynamo-deployment*.yaml) are also pinned to the same CUDA 13 image, and the populate Job is created via ensureModelCache (model cache is enabled by default unless explicitly disabled).

Run/confirm inference-perf validation on H100, GB200, and A100 clusters (and any other accelerators covered by inference-perf overlays, e.g., GB300/B200/H200/RTX Pro) to ensure this CUDA 13 runtime doesn’t break model-cache population or offline serving on older device generations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/model_cache.go` at line 95, The code pins
cacheWorkerImage to a CUDA 13 vLLM runtime unconditionally which can break
model-cache population on older GPU accelerators; update
buildModelCachePopulateJob (and any code that references cacheWorkerImage) to
select the runtime image based on the cluster/node accelerator type (e.g., H100,
A100, GB200, GB300, H200, B200, RTX Pro) or a configurable override, and update
the worker YAML templates
(validators/performance/testdata/inference/dynamo-deployment*.yaml) to support
multiple runtime images or accept an image parameter; specifically, make
cacheWorkerImage a function or switch keyed by accelerator label, wire that into
ensureModelCache which creates the populate Job, and allow an env/flag to force
a particular image for testing so older device generations use compatible CUDA
runtimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant