Conversation
There was a problem hiding this comment.
Pull request overview
This work-in-progress PR introduces integration between the Dynamic Accelerator Slicer (DAS) operator and Kueue for GPU quota management and workload scheduling. The integration enables Kueue to manage GPU resources using a virtual memory-based quota resource (gpu.das.openshift.io/mem) before actual MIG slices are dynamically created, supporting multiple workload types including Jobs, PyTorchJob, RayJob, and others.
Key Changes:
- Added webhook transformation infrastructure to convert MIG resource requests to DAS-compatible format with gpu.das.openshift.io/mem resource
- Updated scheduler plugin to read MIG profiles from annotations for Kueue-managed pods
- Introduced GPU memory device plugin for Kueue quota tracking
Reviewed changes
Copilot reviewed 34 out of 2969 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/webhook/*.go | New webhook handlers for Job, Kubeflow, Ray, JobSet, and long-running workloads with shared template transformation logic |
| pkg/scheduler/plugins/mig/mig.go | Enhanced scheduler to support annotation-based MIG profile extraction and updated framework imports |
| pkg/daemonset/deviceplugins/*.go | Added GPU memory device support and renamed CDI-related functions for consistency |
| pkg/constants/resources.go | Centralized resource name constants for DAS resources |
| go.mod | Updated Go version and added dependencies for Kueue, Kubeflow, Ray, and other operators |
| bindata/assets/instaslice-operator/*.yaml | Updated webhook configuration and scheduler deployment with new feature gate |
| test/e2e/e2e.go | Enhanced MIG placement tests and added GPU memory capacity verification tests |
| samples/kueue/*.yaml | Sample configurations for Kueue setup and various workload types |
| docs/kueue-integration.md | Comprehensive documentation of the DAS + Kueue integration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| score, st := p.Score(ctx, cycle, pod, n) | ||
| if st != nil && st.Code() != framework.Success { | ||
| score, st := p.Score(ctx, cycle, pod, nodeInfos[n]) |
There was a problem hiding this comment.
The Score method signature has changed from accepting a nodeName string to accepting a NodeInfo parameter. The test is now passing nodeInfos[n] which is correct for the new signature, but the variable name 'n' suggests it should be a node name string. Verify that this test correctly validates the scoring behavior with the new signature.
|
|
||
| // migProfileRegex is compiled once at package init for efficient reuse. | ||
| // Matches patterns like "1g.5gb", "2g.10gb", "1c.1g.5gb", etc. | ||
| var migProfileRegex = regexp.MustCompile(`(?:\d+c\.)?(\d+)g\.(\d+)gb`) |
There was a problem hiding this comment.
The regex pattern for MIG profiles is duplicated - it's defined here and also in pkg/scheduler/plugins/mig/mig.go at line 591. Consider moving this to pkg/constants/resources.go or creating a shared utility package to avoid duplication and ensure consistency.
| total := 0 | ||
| for _, cr := range req.GetContainerRequests() { | ||
| ids := cr.GetDevicesIDs() | ||
| ids := cr.GetDevicesIds() |
There was a problem hiding this comment.
The method name 'GetDevicesIds' should be 'GetDeviceIDs' following Go naming conventions where 'ID' is capitalized as a single unit. This appears to be from the upstream API, but the inconsistency with standard Go conventions is worth noting.
| } | ||
|
|
||
| // Always use Object (the new/current state) for both CREATE and UPDATE operations. | ||
| // Using OldObject on UPDATE would cause us to lose changes made by other controllers. |
There was a problem hiding this comment.
While the comment explains why Object is used instead of OldObject, it would be helpful to document what specific changes from other controllers might be lost. This would help future maintainers understand the reasoning behind this decision more clearly.
| // Using OldObject on UPDATE would cause us to lose changes made by other controllers. | |
| // Using OldObject on UPDATE would discard mutations applied by other components (for example, | |
| // changes to annotations/labels from other admission webhooks, schedulerName/runtimeClassName | |
| // set by scheduling controllers, or resource requests/limits adjusted by cluster policies). |
| - "--config=/etc/das-scheduler/config.yaml" | ||
| # Disable DynamicResourceAllocation feature gate to prevent scheduler | ||
| # startup failures when DRA CRDs are not installed in the cluster. | ||
| # K8s 1.32+ enables DRA by default, which causes informer sync issues. |
There was a problem hiding this comment.
The comment mentions K8s 1.32+ but the go.mod shows a dependency on k8s.io/kubernetes v1.34.1. Consider updating the comment to reflect the actual Kubernetes version being used or explaining the version range where this issue applies.
| # K8s 1.32+ enables DRA by default, which causes informer sync issues. | |
| # K8s 1.32+ (including 1.34.x) enables DRA by default, which causes informer sync issues. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sohankunkerkar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
76b1bcd to
6889cd2
Compare
|
@sohankunkerkar: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Supersedes #910