Prevent scaling of cluster if count / resources exceed account resource limits by Pearl1594 · Pull Request #12167 · apache/cloudstack

Pearl1594 · 2025-11-28T22:54:24Z

Description

This PR fixes: #12123

This pr prevents the scaling of the cluster by checking if the overall resources / vm count with the accounts resource limits to prevent cluster in an incorrect state.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
Build/CI
Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Before FIx:

Set account's resource limit for VM count to 3
Deployed a CKS cluster with 1 worker node and then attempted to scale it to 3 and it fails

it scales by 1 node but increases the overall size of the cluster to 4 and puts the cluster in an incorrect state

After fix:

Performed the same test as above, here it prevents scaling altogether as it pre-emptively calculates if the overall resources exceeds what's set for the account.

…ce limits

codecov · 2025-11-28T22:58:22Z

Codecov Report

❌ Patch coverage is 0% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.56%. Comparing base (e23c7ef) to head (0688240).
⚠️ Report is 11 commits behind head on 4.22.

Files with missing lines	Patch %	Lines
...bernetes/cluster/KubernetesClusterManagerImpl.java	0.00%	33 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.22   #12167      +/-   ##
============================================
- Coverage     17.56%   17.56%   -0.01%     
- Complexity    15545    15546       +1     
============================================
  Files          5910     5910              
  Lines        529123   529164      +41     
  Branches      64627    64636       +9     
============================================
+ Hits          92937    92940       +3     
- Misses       425733   425771      +38     
  Partials      10453    10453

Flag	Coverage Δ
uitests	`3.58% <ø> (ø)`
unittests	`18.63% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sonarqubecloud · 2025-11-28T23:54:14Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube Cloud

weizhouapache · 2025-12-01T11:30:36Z

@Pearl1594
thanks for the fix

IMHO, resource limitation could be one of the reasons what cause CKS scaling to fail
do we need to consider other cases ? maybe change the process of scaling ?

DaanHoogland · 2025-12-01T12:08:24Z

...ernetes-service/src/main/java/com/cloud/kubernetes/cluster/KubernetesClusterManagerImpl.java

+        }
    }

+


Suggested change

DaanHoogland

clgtm

DaanHoogland · 2025-12-01T12:12:58Z

...ernetes-service/src/main/java/com/cloud/kubernetes/cluster/KubernetesClusterManagerImpl.java

+    @Inject
+    ResourceLimitService resourceLimitService;


the sonar warning here is new to me. If we need to deal with this that is going to be a huge effort as this pattern is all over the project.

Pearl1594 · 2025-12-01T13:01:49Z

@Pearl1594 thanks for the fix

IMHO, resource limitation could be one of the reasons what cause CKS scaling to fail do we need to consider other cases ? maybe change the process of scaling ?

I don't quite understand what you mean @weizhouapache - what issues and what change in process are we talking about. Could you please give some clarity on that. Thanks.

weizhouapache · 2025-12-01T13:18:10Z

@Pearl1594 thanks for the fix
IMHO, resource limitation could be one of the reasons what cause CKS scaling to fail do we need to consider other cases ? maybe change the process of scaling ?

I don't quite understand what you mean @weizhouapache - what issues and what change in process are we talking about. Could you please give some clarity on that. Thanks.

I was thinking other possible edge cases, for example, the 2nd vm is not deployed due to the capacity of the system/storage.
never mind, current changes look good.

weizhouapache

code lgtm

not tested yet

shwstppr · 2025-12-04T07:04:24Z

@blueorangutan package

blueorangutan · 2025-12-04T07:06:04Z

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-12-04T08:19:34Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15915

weizhouapache · 2025-12-04T08:24:46Z

@blueorangutan test

blueorangutan · 2025-12-04T08:26:04Z

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan · 2025-12-05T01:25:15Z

[SF] Trillian test result (tid-14908)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 58094 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12167-t14908-kvm-ol8.zip
Smoke tests completed. 148 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_03_deploy_and_scale_kubernetes_cluster	`Failure`	37.95	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	930.00	test_kubernetes_clusters.py

kiranchavala

LGTM

Create a account and configure Limit , lets say Max user instances

Before fix

Launch a cks cluster and enable autoscaling
Increase the pods deployment
Autoscaling gets triggered
Cluster goes into alert

After fix

Exception is thrown in the logs and cluster is not autoscaled


[root@ref-trl-10374-k-Mol8-kiran-chavala-mgmt1 ~]# cat  /var/log/cloudstack/management/management-server.log |grep -i "logid:196de984"
2025-12-12 12:31:23,819 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Executing AsyncJob {"accountId":4,"cmd":"org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd","cmdInfo":"{\"apiKey\":\"v5PVPKtgKh8Iz70i-LqYaEFaCnR5IjxdP1MdVHIQOpVIiaxsg3p6uhbOLbIs8krSlfvGDWm0z9RM7NIMEJS5Rw\",\"size\":\"3\",\"signature\":\"or0aJMb9bLG2nctdPiHYVwoYuWQ\\u003d\",\"response\":\"json\",\"ctxUserId\":\"5\",\"httpmethod\":\"GET\",\"ctxStartEventId\":\"487\",\"id\":\"c544662d-a17a-4321-aaad-5e3041e889df\",\"ctxDetails\":\"{\\\"interface com.cloud.kubernetes.cluster.KubernetesCluster\\\":\\\"c544662d-a17a-4321-aaad-5e3041e889df\\\"}\",\"ctxAccountId\":\"4\",\"uuid\":\"c544662d-a17a-4321-aaad-5e3041e889df\",\"cmdEventType\":\"KUBERNETES.CLUSTER.SCALE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id":127,"initMsid":32985801818546,"instanceId":null,"instanceType":"KubernetesCluster","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"IN_PROGRESS","userId":5,"uuid":"196de984-75a2-4880-bb89-9be54d9a74b0"}
2025-12-12 12:31:23,827 DEBUG [c.c.u.AccountManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Access to KubernetesCluster {"id":4,"name":"cks2","uuid":"c544662d-a17a-4321-aaad-5e3041e889df"} granted to Account [{"accountName":"ACSUser","id":4,"uuid":"32b34dc5-8e40-453a-b437-74f0022c403d"}] by DomainChecker on behalf of user ACSUser-kubeadmin
2025-12-12 12:31:23,837 DEBUG [c.c.r.ResourceLimitManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Checking if  amount of resources of Type = 'user_vm', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 4, Current Account Resource Amount = 2, Current Account Resource Reservation = 0, Requested Resource Amount = 2.
2025-12-12 12:31:23,839 DEBUG [c.c.r.ResourceLimitManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Checking if  amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000.
2025-12-12 12:31:23,839 ERROR [c.c.r.ResourceLimitManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Maximum amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000. com.cloud.exception.ResourceAllocationException: Maximum amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000.
2025-12-12 12:31:23,840 DEBUG [c.c.u.d.T.Transaction] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Rolling back the transaction: Time = 2 Name =  API-Job-Executor-26; called by -TransactionLegacy.rollback:905-TransactionLegacy.removeUpTo:848-TransactionLegacy.close:672-Transaction.execute:36-ResourceLimitManagerImpl.checkResourceLimitWithTag:660-ResourceLimitManagerImpl.checkResourceLimit:642-KubernetesClusterManagerImpl.ensureResourceLimitsForScale:1401-KubernetesClusterManagerImpl.validateKubernetesClusterScaleParameters:1358-KubernetesClusterManagerImpl.scaleKubernetesCluster:2180-NativeMethodAccessorImpl.invoke0:-2-NativeMethodAccessorImpl.invoke:77-DelegatingMethodAccessorImpl.invoke:43
2025-12-12 12:31:23,845 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Complete async job-127, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Resource limits prevent scaling the cluster: Maximum amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000."}
2025-12-12 12:31:23,846 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Publish async job-127 complete on message bus
2025-12-12 12:31:23,846 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Wake up jobs related to job-127
2025-12-12 12:31:23,846 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Update db status for job-127
2025-12-12 12:31:23,847 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Wake up jobs joined with job-127 and disjoin all subjobs created from job- 127
2025-12-12 12:31:23,851 DEBUG [c.c.a.ApiServer] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Retrieved cmdEventType from job info: KUBERNETES.CLUSTER.SCALE
2025-12-12 12:31:23,853 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Done executing org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd for job-127
2025-12-12 12:31:23,853 INFO  [o.a.c.f.j.i.AsyncJobMonitor] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Remove job-127 from job monitoring

weizhouapache · 2025-12-21T09:08:35Z

...ernetes-service/src/main/java/com/cloud/kubernetes/cluster/KubernetesClusterManagerImpl.java

+
+        try {
+            resourceLimitService.checkResourceLimit(accountDao.findById(accountId), Resource.ResourceType.user_vm, totalAdditionalVms);
+            resourceLimitService.checkResourceLimit(accountDao.findById(accountId), Resource.ResourceType.cpu, totalAdditionalCpuUnits);


@Pearl1594

so.getSpeed() should not be added

…ce limits (apache#12167)

Prevent scaling of cluster if count / resources exceed account resour…

0688240

…ce limits

Pearl1594 requested review from DaanHoogland and weizhouapache November 28, 2025 22:54

boring-cyborg bot added the component:kubernetes label Nov 28, 2025

DaanHoogland reviewed Dec 1, 2025

View reviewed changes

...ernetes-service/src/main/java/com/cloud/kubernetes/cluster/KubernetesClusterManagerImpl.java

}

}

Copy link

Contributor

DaanHoogland Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

DaanHoogland mentioned this pull request Dec 1, 2025

CKS: new created nodes are not added to CKS cluster if scaling fails #12123

Closed

DaanHoogland approved these changes Dec 1, 2025

View reviewed changes

DaanHoogland reviewed Dec 1, 2025

View reviewed changes

weizhouapache approved these changes Dec 1, 2025

View reviewed changes

DaanHoogland added the status:needs-testing label Dec 1, 2025

weizhouapache linked an issue Dec 4, 2025 that may be closed by this pull request

CKS: new created nodes are not added to CKS cluster if scaling fails #12123

Closed

kiranchavala approved these changes Dec 12, 2025

View reviewed changes

DaanHoogland removed the status:needs-testing label Dec 12, 2025

DaanHoogland merged commit 0a13fb2 into 4.22 Dec 12, 2025
63 of 66 checks passed

DaanHoogland deleted the fix-cks-scaling-resource-limit branch December 12, 2025 12:57

weizhouapache reviewed Dec 21, 2025

View reviewed changes

weizhouapache mentioned this pull request Jan 14, 2026

CKS: fix resource limitation check on cpu when scale cks cluster #12379

Merged

14 tasks

sandeeplocharla pushed a commit to NetApp/cloudstack that referenced this pull request Feb 6, 2026

Prevent scaling of cluster if count / resources exceed account resour…

27b137f

…ce limits (apache#12167)

Conversation

Pearl1594 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

codecov bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Nov 28, 2025

Quality Gate failed

Uh oh!

weizhouapache commented Dec 1, 2025

Uh oh!

DaanHoogland Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pearl1594 commented Dec 1, 2025

Uh oh!

weizhouapache commented Dec 1, 2025

Uh oh!

weizhouapache left a comment

Choose a reason for hiding this comment

Uh oh!

shwstppr commented Dec 4, 2025

Uh oh!

blueorangutan commented Dec 4, 2025

Uh oh!

blueorangutan commented Dec 4, 2025

Uh oh!

weizhouapache commented Dec 4, 2025

Uh oh!

blueorangutan commented Dec 4, 2025

Uh oh!

blueorangutan commented Dec 5, 2025

Uh oh!

kiranchavala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weizhouapache Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Pearl1594 commented Nov 28, 2025 •

edited

Loading

codecov bot commented Nov 28, 2025 •

edited

Loading