Skip to content

[WIP] DAS + Kueue Integration#972

Open
sohankunkerkar wants to merge 7 commits intoopenshift:nextfrom
sohankunkerkar:kueue-integration-structured
Open

[WIP] DAS + Kueue Integration#972
sohankunkerkar wants to merge 7 commits intoopenshift:nextfrom
sohankunkerkar:kueue-integration-structured

Conversation

@sohankunkerkar
Copy link
Member

Supersedes #910

Copilot AI review requested due to automatic review settings December 19, 2025 20:06
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This work-in-progress PR introduces integration between the Dynamic Accelerator Slicer (DAS) operator and Kueue for GPU quota management and workload scheduling. The integration enables Kueue to manage GPU resources using a virtual memory-based quota resource (gpu.das.openshift.io/mem) before actual MIG slices are dynamically created, supporting multiple workload types including Jobs, PyTorchJob, RayJob, and others.

Key Changes:

  • Added webhook transformation infrastructure to convert MIG resource requests to DAS-compatible format with gpu.das.openshift.io/mem resource
  • Updated scheduler plugin to read MIG profiles from annotations for Kueue-managed pods
  • Introduced GPU memory device plugin for Kueue quota tracking

Reviewed changes

Copilot reviewed 34 out of 2969 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/webhook/*.go New webhook handlers for Job, Kubeflow, Ray, JobSet, and long-running workloads with shared template transformation logic
pkg/scheduler/plugins/mig/mig.go Enhanced scheduler to support annotation-based MIG profile extraction and updated framework imports
pkg/daemonset/deviceplugins/*.go Added GPU memory device support and renamed CDI-related functions for consistency
pkg/constants/resources.go Centralized resource name constants for DAS resources
go.mod Updated Go version and added dependencies for Kueue, Kubeflow, Ray, and other operators
bindata/assets/instaslice-operator/*.yaml Updated webhook configuration and scheduler deployment with new feature gate
test/e2e/e2e.go Enhanced MIG placement tests and added GPU memory capacity verification tests
samples/kueue/*.yaml Sample configurations for Kueue setup and various workload types
docs/kueue-integration.md Comprehensive documentation of the DAS + Kueue integration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
score, st := p.Score(ctx, cycle, pod, n)
if st != nil && st.Code() != framework.Success {
score, st := p.Score(ctx, cycle, pod, nodeInfos[n])
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Score method signature has changed from accepting a nodeName string to accepting a NodeInfo parameter. The test is now passing nodeInfos[n] which is correct for the new signature, but the variable name 'n' suggests it should be a node name string. Verify that this test correctly validates the scoring behavior with the new signature.

Copilot uses AI. Check for mistakes.

// migProfileRegex is compiled once at package init for efficient reuse.
// Matches patterns like "1g.5gb", "2g.10gb", "1c.1g.5gb", etc.
var migProfileRegex = regexp.MustCompile(`(?:\d+c\.)?(\d+)g\.(\d+)gb`)
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern for MIG profiles is duplicated - it's defined here and also in pkg/scheduler/plugins/mig/mig.go at line 591. Consider moving this to pkg/constants/resources.go or creating a shared utility package to avoid duplication and ensure consistency.

Copilot uses AI. Check for mistakes.
total := 0
for _, cr := range req.GetContainerRequests() {
ids := cr.GetDevicesIDs()
ids := cr.GetDevicesIds()
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name 'GetDevicesIds' should be 'GetDeviceIDs' following Go naming conventions where 'ID' is capitalized as a single unit. This appears to be from the upstream API, but the inconsistency with standard Go conventions is worth noting.

Copilot uses AI. Check for mistakes.
}

// Always use Object (the new/current state) for both CREATE and UPDATE operations.
// Using OldObject on UPDATE would cause us to lose changes made by other controllers.
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the comment explains why Object is used instead of OldObject, it would be helpful to document what specific changes from other controllers might be lost. This would help future maintainers understand the reasoning behind this decision more clearly.

Suggested change
// Using OldObject on UPDATE would cause us to lose changes made by other controllers.
// Using OldObject on UPDATE would discard mutations applied by other components (for example,
// changes to annotations/labels from other admission webhooks, schedulerName/runtimeClassName
// set by scheduling controllers, or resource requests/limits adjusted by cluster policies).

Copilot uses AI. Check for mistakes.
- "--config=/etc/das-scheduler/config.yaml"
# Disable DynamicResourceAllocation feature gate to prevent scheduler
# startup failures when DRA CRDs are not installed in the cluster.
# K8s 1.32+ enables DRA by default, which causes informer sync issues.
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions K8s 1.32+ but the go.mod shows a dependency on k8s.io/kubernetes v1.34.1. Consider updating the comment to reflect the actual Kubernetes version being used or explaining the version range where this issue applies.

Suggested change
# K8s 1.32+ enables DRA by default, which causes informer sync issues.
# K8s 1.32+ (including 1.34.x) enables DRA by default, which causes informer sync issues.

Copilot uses AI. Check for mistakes.
@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 19, 2025
@openshift-ci openshift-ci bot requested review from empovit and harche December 19, 2025 20:08
@openshift-ci
Copy link

openshift-ci bot commented Dec 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sohankunkerkar
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sohankunkerkar sohankunkerkar force-pushed the kueue-integration-structured branch from 76b1bcd to 6889cd2 Compare December 19, 2025 20:53
@openshift-ci
Copy link

openshift-ci bot commented Feb 6, 2026

@sohankunkerkar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-bundle-runc 6889cd2 link true /test e2e-bundle-runc
ci/prow/images 6889cd2 link true /test images
ci/prow/e2e-bundle-4-19-runc 6889cd2 link true /test e2e-bundle-4-19-runc
ci/prow/ci-index-das-operator-bundle 6889cd2 link true /test ci-index-das-operator-bundle
ci/prow/e2e-bundle-4-20-runc 6889cd2 link true /test e2e-bundle-4-20-runc
ci/prow/4.18-periodics-ci-index-das-operator-bundle 6889cd2 link true /test 4.18-periodics-ci-index-das-operator-bundle
ci/prow/4.18-periodics-images 6889cd2 link true /test 4.18-periodics-images
ci/prow/4.19-periodics-ci-index-das-operator-bundle 6889cd2 link true /test 4.19-periodics-ci-index-das-operator-bundle
ci/prow/4.19-periodics-images 6889cd2 link true /test 4.19-periodics-images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants