-
Notifications
You must be signed in to change notification settings - Fork 207
Add replica groups in dstack-service #3408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Will be solving merge conflicts as review continues. |
Related PRs#3205 from @DragonStuff |
|
@Bihan Do we really need replica group names? |
|
@Bihan Also please check the conflicts with |
|
Cosmetics only: I would rename |
Yes. will rename it. |
Yes. Without replica names, we would rely on indices, which are position-dependent. If groups are reordered by users during manual scaling, indices shift, but existing jobs and persisted state (like desired_replica_counts) still reference the old positions. This mismatch prevents reliable identification of which group a job belongs to, leading to incorrect scaling decisions. Replica names are not affected by reordering in the YAML file. Initial Manual Scaling Instead of relying on replica group's position in the config, another possibility is matching job specs to identify replicas; but this approach fails during rolling deployments because old and new jobs from the same group have different specs. |
|
As a user I find it unnecessary to give names. I would prefer not to ask names if this is possible technically. |
If a user changes commands for group 0 and reorders groups at the same time, they expect a rolling deployment for group 0 only. However, the system detects the order change and triggers a full redeployment for all groups. Users may find this implicit behavior annoying because it provisions extra instances for each groups. |
|
Perhaps we could make these names optional? |
Yes, we can make it optional. |
add_replica_groups_model Replica Groups AutoScaling Rolling deployment and UI Replica Groups implementation clean up
86139c5 to
5abbcad
Compare
Steps To Test
Step1: Create
replica-groups-service.ymlStep2:
dstack apply -f replica-groups-service.ymlStep3: Run
load_test_replica_groups.pyby subsituting yourURLandTOKENExpected Output
Each group gets one replica
Later, both groups scale respecting group configs.
group0 scales to 2 replicas,
and group1 scales to 3.
Below is the expected output
Step4: Check whether replica specific commands were executed.
Attach to the desired replica
Eg:
dstack attach -replica 2 replica-groups-testssh replica-groups-test-0-2 'cat /tmp/version.txt'output: Group 1 - Version 0Step5: Check rolling deployment.
Important:
Rolling deployments are currently affected by a race condition that also impacts the non–replica group implementation and must be addressed separately (issue). However, when each replica group is configured with a single replica, this race condition does not affect rolling deployments.
Testing instructions:
Scale down each replica group to 1 replica.
Restart the load-testing script with RPS = 2.
After all groups have scaled down to a single replica, re-apply the configuration:
Re-apply
dstack apply -f replica-groups-service.yml