-
Notifications
You must be signed in to change notification settings - Fork 139
feat: Added subgroups in pytorch jobs #935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b3330df to
c25da43
Compare
5d9476f to
9ad48a5
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
1 similar comment
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
8e0c415 to
7794422
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
| ReplicaTypeLabel = pytorchv1.ReplicaTypeLabel | ||
|
|
||
| ReplicaTypeMaster = pytorchv1.PyTorchJobReplicaTypeMaster | ||
| ReplicaTypeWorker = pytorchv1.PyTorchJobReplicaTypeWorker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making these private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| masterReplicas = 0 | ||
| } | ||
|
|
||
| workerReplicas := totalMinAvailable - int32(masterReplicas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if totalMinAvailable < masterReplicas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| subGroups = append(subGroups, &podgroup.SubGroupMetadata{ | ||
| Name: replicaType, | ||
| MinAvailable: minAvailable, | ||
| PodsReferences: podReferences, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we aggregate all this replica type pod references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodReferences are aggregated: they're used by the podgroup handler to assign pods to subgroups - once a pod is assigned, there's no need to re-assign it. Doing so will require us to list the pods in the namespace and iterate over them. https://github.com/NVIDIA/KAI-Scheduler/blob/main/pkg/podgrouper/podgroup/handler.go#L143
rich7420
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itsomri thanks for the patch!
LGTM
Description
Implemented subgroups in pytorch:
Related Issues
Fixes #
Checklist
Breaking Changes
Additional Notes