Prevent scaling of cluster if count / resources exceed account resource limits#12167
Prevent scaling of cluster if count / resources exceed account resource limits#12167DaanHoogland merged 1 commit into4.22from
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.22 #12167 +/- ##
============================================
- Coverage 17.56% 17.56% -0.01%
- Complexity 15545 15546 +1
============================================
Files 5910 5910
Lines 529123 529164 +41
Branches 64627 64636 +9
============================================
+ Hits 92937 92940 +3
- Misses 425733 425771 +38
Partials 10453 10453
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
@Pearl1594 IMHO, resource limitation could be one of the reasons what cause CKS scaling to fail |
| } | ||
| } | ||
|
|
||
|
|
| @Inject | ||
| ResourceLimitService resourceLimitService; |
There was a problem hiding this comment.
the sonar warning here is new to me. If we need to deal with this that is going to be a huge effort as this pattern is all over the project.
I don't quite understand what you mean @weizhouapache - what issues and what change in process are we talking about. Could you please give some clarity on that. Thanks. |
I was thinking other possible edge cases, for example, the 2nd vm is not deployed due to the capacity of the system/storage. |
weizhouapache
left a comment
There was a problem hiding this comment.
code lgtm
not tested yet
|
@blueorangutan package |
|
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15915 |
|
@blueorangutan test |
|
@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-14908)
|
kiranchavala
left a comment
There was a problem hiding this comment.
LGTM
Create a account and configure Limit , lets say Max user instances
Before fix
- Launch a cks cluster and enable autoscaling
- Increase the pods deployment
- Autoscaling gets triggered
- Cluster goes into alert
After fix
Exception is thrown in the logs and cluster is not autoscaled
[root@ref-trl-10374-k-Mol8-kiran-chavala-mgmt1 ~]# cat /var/log/cloudstack/management/management-server.log |grep -i "logid:196de984"
2025-12-12 12:31:23,819 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Executing AsyncJob {"accountId":4,"cmd":"org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd","cmdInfo":"{\"apiKey\":\"v5PVPKtgKh8Iz70i-LqYaEFaCnR5IjxdP1MdVHIQOpVIiaxsg3p6uhbOLbIs8krSlfvGDWm0z9RM7NIMEJS5Rw\",\"size\":\"3\",\"signature\":\"or0aJMb9bLG2nctdPiHYVwoYuWQ\\u003d\",\"response\":\"json\",\"ctxUserId\":\"5\",\"httpmethod\":\"GET\",\"ctxStartEventId\":\"487\",\"id\":\"c544662d-a17a-4321-aaad-5e3041e889df\",\"ctxDetails\":\"{\\\"interface com.cloud.kubernetes.cluster.KubernetesCluster\\\":\\\"c544662d-a17a-4321-aaad-5e3041e889df\\\"}\",\"ctxAccountId\":\"4\",\"uuid\":\"c544662d-a17a-4321-aaad-5e3041e889df\",\"cmdEventType\":\"KUBERNETES.CLUSTER.SCALE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id":127,"initMsid":32985801818546,"instanceId":null,"instanceType":"KubernetesCluster","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"IN_PROGRESS","userId":5,"uuid":"196de984-75a2-4880-bb89-9be54d9a74b0"}
2025-12-12 12:31:23,827 DEBUG [c.c.u.AccountManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Access to KubernetesCluster {"id":4,"name":"cks2","uuid":"c544662d-a17a-4321-aaad-5e3041e889df"} granted to Account [{"accountName":"ACSUser","id":4,"uuid":"32b34dc5-8e40-453a-b437-74f0022c403d"}] by DomainChecker on behalf of user ACSUser-kubeadmin
2025-12-12 12:31:23,837 DEBUG [c.c.r.ResourceLimitManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Checking if amount of resources of Type = 'user_vm', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 4, Current Account Resource Amount = 2, Current Account Resource Reservation = 0, Requested Resource Amount = 2.
2025-12-12 12:31:23,839 DEBUG [c.c.r.ResourceLimitManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Checking if amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000.
2025-12-12 12:31:23,839 ERROR [c.c.r.ResourceLimitManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Maximum amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000. com.cloud.exception.ResourceAllocationException: Maximum amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000.
2025-12-12 12:31:23,840 DEBUG [c.c.u.d.T.Transaction] (API-Job-Executor-26:[ctx-ed5daccd, job-127, ctx-36d1db15]) (logid:196de984) Rolling back the transaction: Time = 2 Name = API-Job-Executor-26; called by -TransactionLegacy.rollback:905-TransactionLegacy.removeUpTo:848-TransactionLegacy.close:672-Transaction.execute:36-ResourceLimitManagerImpl.checkResourceLimitWithTag:660-ResourceLimitManagerImpl.checkResourceLimit:642-KubernetesClusterManagerImpl.ensureResourceLimitsForScale:1401-KubernetesClusterManagerImpl.validateKubernetesClusterScaleParameters:1358-KubernetesClusterManagerImpl.scaleKubernetesCluster:2180-NativeMethodAccessorImpl.invoke0:-2-NativeMethodAccessorImpl.invoke:77-DelegatingMethodAccessorImpl.invoke:43
2025-12-12 12:31:23,845 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Complete async job-127, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Resource limits prevent scaling the cluster: Maximum amount of resources of Type = 'cpu', tag = 'null' for Account Name = ACSUser in Domain Id = 1 is exceeded: Account Resource Limit = 40, Current Account Resource Amount = 4, Current Account Resource Reservation = 0, Requested Resource Amount = 4000."}
2025-12-12 12:31:23,846 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Publish async job-127 complete on message bus
2025-12-12 12:31:23,846 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Wake up jobs related to job-127
2025-12-12 12:31:23,846 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Update db status for job-127
2025-12-12 12:31:23,847 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Wake up jobs joined with job-127 and disjoin all subjobs created from job- 127
2025-12-12 12:31:23,851 DEBUG [c.c.a.ApiServer] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Retrieved cmdEventType from job info: KUBERNETES.CLUSTER.SCALE
2025-12-12 12:31:23,853 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Done executing org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd for job-127
2025-12-12 12:31:23,853 INFO [o.a.c.f.j.i.AsyncJobMonitor] (API-Job-Executor-26:[ctx-ed5daccd, job-127]) (logid:196de984) Remove job-127 from job monitoring
|
|
||
| try { | ||
| resourceLimitService.checkResourceLimit(accountDao.findById(accountId), Resource.ResourceType.user_vm, totalAdditionalVms); | ||
| resourceLimitService.checkResourceLimit(accountDao.findById(accountId), Resource.ResourceType.cpu, totalAdditionalCpuUnits); |
There was a problem hiding this comment.
- so.getSpeed() should not be added


Description
This PR fixes: #12123
This pr prevents the scaling of the cluster by checking if the overall resources / vm count with the accounts resource limits to prevent cluster in an incorrect state.
Types of changes
Feature/Enhancement Scale or Bug Severity
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?
Before FIx:
Set account's resource limit for VM count to 3

Deployed a CKS cluster with 1 worker node and then attempted to scale it to 3 and it fails
it scales by 1 node but increases the overall size of the cluster to 4 and puts the cluster in an incorrect state
After fix:
Performed the same test as above, here it prevents scaling altogether as it pre-emptively calculates if the overall resources exceeds what's set for the account.