Skip to content

Conversation

@itsomri
Copy link
Collaborator

@itsomri itsomri commented Jan 28, 2026

Description

Implemented subgroups in pytorch:

Related Issues

Fixes #

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

Additional Notes

@itsomri itsomri closed this Jan 28, 2026
@itsomri itsomri force-pushed the omric/pytorch-subgrouping branch from b3330df to c25da43 Compare January 28, 2026 14:49
@itsomri itsomri reopened this Jan 28, 2026
@itsomri itsomri force-pushed the omric/pytorch-subgrouping branch from 5d9476f to 9ad48a5 Compare January 28, 2026 15:08
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch 84.31% (+0.98%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper.go 84.31% (+0.98%) 51 (+27) 43 (+23) 8 (+4) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper_test.go

1 similar comment
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch 84.31% (+0.98%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper.go 84.31% (+0.98%) 51 (+27) 43 (+23) 8 (+4) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper_test.go

@itsomri itsomri enabled auto-merge January 29, 2026 09:26
@itsomri itsomri disabled auto-merge January 29, 2026 09:26
@itsomri itsomri enabled auto-merge January 29, 2026 09:26
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch 84.31% (+0.98%) 👍
github.com/NVIDIA/KAI-scheduler/test/e2e/modules/resources/rd/pod_group 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper.go 84.31% (+0.98%) 51 (+27) 43 (+23) 8 (+4) 👍
github.com/NVIDIA/KAI-scheduler/test/e2e/modules/resources/rd/pod_group/pod_group.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper_test.go

@itsomri itsomri force-pushed the omric/pytorch-subgrouping branch from 8e0c415 to 7794422 Compare January 29, 2026 11:22
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch 84.31% (+0.98%) 👍
github.com/NVIDIA/KAI-scheduler/test/e2e/modules/resources/rd/pod_group 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper.go 84.31% (+0.98%) 51 (+27) 43 (+23) 8 (+4) 👍
github.com/NVIDIA/KAI-scheduler/test/e2e/modules/resources/rd/pod_group/pod_group.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/kubeflow/pytorch/pytorch_grouper_test.go

Comment on lines 22 to 25
ReplicaTypeLabel = pytorchv1.ReplicaTypeLabel

ReplicaTypeMaster = pytorchv1.PyTorchJobReplicaTypeMaster
ReplicaTypeWorker = pytorchv1.PyTorchJobReplicaTypeWorker
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making these private?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

masterReplicas = 0
}

workerReplicas := totalMinAvailable - int32(masterReplicas)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if totalMinAvailable < masterReplicas?

Copy link
Collaborator Author

@itsomri itsomri Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

subGroups = append(subGroups, &podgroup.SubGroupMetadata{
Name: replicaType,
MinAvailable: minAvailable,
PodsReferences: podReferences,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we aggregate all this replica type pod references?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodReferences are aggregated: they're used by the podgroup handler to assign pods to subgroups - once a pod is assigned, there's no need to re-assign it. Doing so will require us to list the pods in the namespace and iterate over them. https://github.com/NVIDIA/KAI-Scheduler/blob/main/pkg/podgrouper/podgroup/handler.go#L143

Copy link
Contributor

@rich7420 rich7420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itsomri thanks for the patch!
LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants