scale to breaking testing
Recent events has highlight some performance issue with clusters at scale.
See Review resource limits on DNS -Operator for details on that issue.
Though that work a new expected cluster size has being found.
This cluster size contains.
- 6000 secrets
- 6000 configmaps
- 2000 deployments
- 4000 services
- 2000 endpoints
- 2000 networking.istio.io
- 1000 networkpolices
(number have being rounded up)
Along with a thousands of operators, pods, rolebindings.
For the kuadrant-operator there is a scale test, the testing is focused on the interaction of Polices.
This gives good insight into kuadrant as a whole.
However is it gives no interaction into the levels of performance that the dns-operator can manage.
In the scale test linked above the number of records created would have being 128.
This issue is whats to bring the focus to understanding the performance limits of the dns-operator.
Answering some of the questions that today are not answer, but possible also not being asked.
Possible scenarios
Large number of dnsrecord with shared secret
Given the resource listed above, it would not be unexpected if there were 2000 to 4000 dnsrecords created on the cluster.
In this scenario these record could have a shared providerRef secret, or use the default.
From a memory stand point I expect the dns-operator to many that amount of resources with out issue.
Processing those that number of records, that I am unsure about.
How long would a reconcile loop take?
What is our expected reconcile loop time?
How would pod restarts handle the load?
What will the re-queue system do?
Large number of dnsrecord without shared secret
Like the first scenario, but this time each dnsrecord is given its own providerRef secret.
This double the number of resources that the operator is expected to handle.
Again I expect the dns-operator to handle 2000 to 4000 dnsrecords with a following 2000 to 4000 secrets.
What does the CPU do?
How long would a reconcile loop take?
What is our expected reconcile loop time?
How would pod restarts handle the load?
What will the re-queue system do?
Large number of dnsrecords on a secondary cluster.
Our multi cluster support allows user to create a large number of record on a secondary cluster.
What happens when that large number of records is rolled into one authoritative record?
Etcd has a limit of 1.5 MiB for resource size.
How many dnsrecords would be required to hit that limit?
What would the operator do?
If there is 2000 to 4000 records on the secondary cluster.
What is the CPU usage on both the secondary and primary?
What is the memory profile like?
What kills the operator.
In all these case I am assuming the operator can manage 2000 to 4000 dnsrecords.
We don't know what that limit is today.
In this scenario, we push the operator pasted its breaking point.
At what point does k8s restart the pods?
When can the CPU not get thought the back log?
What metrics would give an early warning to the system failing?
scale to breaking testing
Recent events has highlight some performance issue with clusters at scale.
See Review resource limits on DNS -Operator for details on that issue.
Though that work a new expected cluster size has being found.
This cluster size contains.
(number have being rounded up)
Along with a thousands of operators, pods, rolebindings.
For the kuadrant-operator there is a scale test, the testing is focused on the interaction of Polices.
This gives good insight into kuadrant as a whole.
However is it gives no interaction into the levels of performance that the dns-operator can manage.
In the scale test linked above the number of records created would have being 128.
This issue is whats to bring the focus to understanding the performance limits of the dns-operator.
Answering some of the questions that today are not answer, but possible also not being asked.
Possible scenarios
Large number of dnsrecord with shared secret
Given the resource listed above, it would not be unexpected if there were 2000 to 4000 dnsrecords created on the cluster.
In this scenario these record could have a shared providerRef secret, or use the default.
From a memory stand point I expect the dns-operator to many that amount of resources with out issue.
Processing those that number of records, that I am unsure about.
How long would a reconcile loop take?
What is our expected reconcile loop time?
How would pod restarts handle the load?
What will the re-queue system do?
Large number of dnsrecord without shared secret
Like the first scenario, but this time each dnsrecord is given its own providerRef secret.
This double the number of resources that the operator is expected to handle.
Again I expect the dns-operator to handle 2000 to 4000 dnsrecords with a following 2000 to 4000 secrets.
What does the CPU do?
How long would a reconcile loop take?
What is our expected reconcile loop time?
How would pod restarts handle the load?
What will the re-queue system do?
Large number of dnsrecords on a secondary cluster.
Our multi cluster support allows user to create a large number of record on a secondary cluster.
What happens when that large number of records is rolled into one authoritative record?
Etcd has a limit of 1.5 MiB for resource size.
How many dnsrecords would be required to hit that limit?
What would the operator do?
If there is 2000 to 4000 records on the secondary cluster.
What is the CPU usage on both the secondary and primary?
What is the memory profile like?
What kills the operator.
In all these case I am assuming the operator can manage 2000 to 4000 dnsrecords.
We don't know what that limit is today.
In this scenario, we push the operator pasted its breaking point.
At what point does k8s restart the pods?
When can the CPU not get thought the back log?
What metrics would give an early warning to the system failing?