Use C* authentication with Reaper #222

jsanda · 2020-08-25T20:54:13Z

This is for #211.

johntrimble

It's looking good John! I left some comments.

johntrimble · 2020-08-26T18:02:08Z

operator/pkg/reconciliation/reconcile_reaper.go

+		// Requeue with a delay to give a chance for C* schema changes to propagate
+		//
+		// TODO Should the delay be adjusted based on the C* cluster size?
+		return result.RequeueSoon(5)


Is there any way to ask Cassandra if the changes are done propagating? Like a query we can run or something?

There is. C* driver provide an API to check for schema agreement. With the Java driver for example, it is just a single method call, checkSchemaAgreement(), that returns a boolean. Internally, the driver executes the following queries:

SELECT peer, rpc_address, schema_version FROM system.peers

and

SELECT schema_version FROM system.local WHERE key='local'

It checks that schema_version is the same across all rows.

It would be easy to just add the checkSchemaAgreement call to the schema init job; however, I plan to get rid of the schema init job now that the management api supports creating keyspaces. Given that I will implement a checkSchemaAgreement function in the operator.

johntrimble · 2020-08-26T18:06:01Z

operator/pkg/apis/cassandra/v1beta1/cassandradatacenter_types.go

+
+	s := string(b)
+
+	return strings.Contains(s, "PasswordAuthenticator"), nil


Hmmm... so this config you have here isn't the full story. This will get handed to the config builder that will actually generate the cassandra.yaml. In the past, the defaults for the config builder have sometimes varied from that of the actual defaults (as we do with the SeedProvider impl). Right now, we use the "AllowAllAuthenticator" default in configbuilder, but that's the sort of thing I could see changing. Is there any reason we cannot just assume authentication is enabled? I think we always create a superuser with credentials irrespective of the authentication setting.

Also, if you look at json.RawMessage you'll see it's just a []byte:

type RawMessage []byte

So, if you are going to cheat and not actually parse the JSON, you can just do:

return strings.Contains(string(dc.Spec.Config), "PasswordAuthenticator")

Then you don't have to return an error from the function either. If the JSON is ill-formed, it will blow up elsewhere.

I agree with assuming Authentication is enabled - because will all of this work if they toggle it on after setup?

johntrimble · 2020-08-26T18:17:09Z

operator/pkg/reconciliation/secrets.go

+	//
+	//}
+	//
+	//return nil, nil


From debugging?

johntrimble · 2020-08-26T18:21:03Z

tests/testdata/oss-three-node-with-auth-dc.yaml

+    - name: r3
+  config:
+    cassandra-yaml:
+      authenticator: org.apache.cassandra.auth.PasswordAuthenticator


If you tell reaper to use auth and give it the credentials for an actual user, but the PasswordAuthenticator is not actually enabled, will it still work?

jimdickinson · 2020-08-27T14:41:11Z

operator/pkg/apis/cassandra/v1beta1/cassandradatacenter_types.go

+	"os"
+	"strings"


nit - stdlib imports should be the first group

Before Reaper can successfully deploy, Cassandra roles have to be created and the reaper keyspace needs to be initialized. When the statefulset is created, the reaper container is not deployed initially anymore. When the schema init job completes, the new ReaperStatus.SchemaInitialized field is set to true. On the subsequent reconciliation loop the statefulset(s) are updated to deploy the reaper container.

…to propagate

…job finishes

This commit adds a function for altering a keyspace which I needed to configure the RF of system_auth. When authentication is enabled, system_auth should never have RF = 1. Because the amount of data is very small, increase the RF even when authentication is disabled is not a big deal. In my testing for reaper sidecar, I found that the RF of system_auth has to be changed when authentication is enabled; otherwise, there will be lots of flapping of containers as only one C* node will have the credentials and it might not the node that is up.

Reaper coordinates schema changes with distributed locks implemented in a C* table. This is done in order to avoid concurrent schema updates which is problematic in C*. The locks table uses IP addresses. When (not if) the reaper sidecar container liveness and/or readiness probe fails and the container restarts, it will come back up with a differnet IP address. Reaper will then have to wait to reacquire the lock which can lead to additional liveness and/or readiness probe failures. By applying schema updates with a single instance via a job, these problems go away. Lastly, based on this as well as on previous experience, I firmly believe that C* schema changes in a k8s deployment should always be done in a k8s job to avoid these types of problems.

jsanda · 2020-09-13T03:48:32Z

So.... This has turned into quite a bit of work which includes changes in Reaper. Please see thelastpickle/cassandra-reaper#956 for some additional context.

There are two things I want to point out with the latest commits I pushed. First, I have temporarily added a second job for applying Reaper's schema. The initial, original job runs a python script that creates the keyspace. The second job runs a modified version of Reaper that only applies schema updates. I plan to merge all of this into a single image which can then be run in one job.

Secondly, I have added code for configure the replication_factor of the system_auth keyspace. If authentication is enabled and system_auth is using RF = 1, there is a lot of flapping with the reaper containers. This is due to the fact that the C* node which has the auth data might not be up yet. This resulted in my debugging a lot of inconsistent test failures. This change is not specific to Reaper though. When auth is enabled, system_auth should never use RF = 1. And even if auth is disabled it doesn't hurt to the RF because the amount of data involved is very small.

jimdickinson · 2020-09-13T04:18:51Z

operator/pkg/apis/cassandra/v1beta1/cassandradatacenter_types.go

-
-	"github.com/datastax/cass-operator/operator/pkg/serverconfig"
-	"github.com/datastax/cass-operator/operator/pkg/utils"
+	"os"


nit: this import switch should be reverted

@jimdickinson by any chance are there any project style settings that I can import into GoLand? Or are there style guidelines available so that I can configure GoLand to be consistent?

jimdickinson · 2020-09-13T04:38:47Z

operator/pkg/httphelper/client.go

+	ReplicationFactor int
+}
+
+func (client *NodeMgmtClient) CallAlterKeyspaceEndpoint(pod *corev1.Pod, keyspace string, replicationSettings[]ReplicationSetting) error {


Suggested change

func (client *NodeMgmtClient) CallAlterKeyspaceEndpoint(pod *corev1.Pod, keyspace string, replicationSettings[]ReplicationSetting) error {

func (client *NodeMgmtClient) CallAlterKeyspaceEndpoint(pod *corev1.Pod, keyspace string, replicationSettings []ReplicationSetting) error {

johntrimble · 2020-09-13T20:00:43Z

operator/pkg/reconciliation/reconcile_racks.go

+}
+
+func getSystemAuthRF(dc *api.CassandraDatacenter) int {
+	return int(math.Min(float64(3), float64(dc.Spec.Size)))


I wonder if we should use the number of racks, instead of the number of nodes, in this math.Min() call. In the isClusterHealthy() function, we seem to use the rack count as just such a proxy for the RF:

func (rc *ReconciliationContext) isClusterHealthy() bool { pods := FilterPodListByCassNodeState(rc.clusterPods, stateStarted) numRacks := len(rc.Datacenter.GetRacks()) for _, pod := range pods { err := rc.NodeMgmtClient.CallProbeClusterEndpoint(pod, "LOCAL_QUORUM", numRacks) if err != nil { return false } } return true }

So maybe return int(math.Min(float64(3), float64(dc.GetRacks()))) instead?

@johntrimble what about the scenario though in which we have just a single rack? We would end up with RF = 1 which is what we want to avoid.

What I'm observing in the code is that we use the number of racks to indicate the intended RF, as in the example above. Is using the number of racks not a suitable proxy for the intended RF? My expectation would be that the only case in which we would not have 3 racks would be the case in which a user is creating a cluster of only 1 node. Is there a reason a user would want to create a cluster of 3+ nodes with only a single rack?

It's possible the code example I gave above is what needs to change, but I'd like for us to do this in a consistent manner.

Is using the number of racks not a suitable proxy for the intended RF?

Having RF = num_racks is definitely a good practice.

My expectation would be that the only case in which we would not have 3 racks would be the case in which a user is creating a cluster of only 1 node. Is there a reason a user would want to create a cluster of 3+ nodes with only a single rack?

My understanding is that the operator allows you to specify both the number of nodes and the number of racks. I know that the operator will try to create balanced racks, but the key thing for me here is that there is nothing preventing me from creating a C* cluster with just one rack. If I am only running in a single zone, then a single rack is fine with respect to C* topology.

If we determine the RF of system_auth by the number of racks, then we can wind up with RF = 1 for system_auth even if there are multiple nodes. That would be bad, and not just for Reaper.

jimdickinson · 2020-11-03T22:00:13Z

#307 supersedes this one

jsanda requested review from bradfordcp, jimdickinson, johntrimble, respringer and sandoichi as code owners August 25, 2020 20:54

jsanda marked this pull request as draft August 25, 2020 20:58

jsanda changed the title ~~[WIP] Use C* authentication with Reaper~~ Use C* authentication with Reaper Aug 25, 2020

jsanda marked this pull request as ready for review August 26, 2020 03:43

johntrimble reviewed Aug 26, 2020

View reviewed changes

jimdickinson reviewed Aug 27, 2020

View reviewed changes

jsanda added 6 commits August 31, 2020 18:29

update tests

17b2474

update and tests

211c7de

requeue with delay after schema job finishes to allow schema changes …

e3e45f6

…to propagate

update integration test to use C* cluster with auth enabled

3a3c2b7

remove auth check and check for schema agreement after reaper schema …

68d0bc4

…job finishes

jsanda force-pushed the reaper-auth branch from 2d8d9f1 to 68d0bc4 Compare August 31, 2020 22:48

jsanda added 4 commits September 12, 2020 23:01

and helper function and update verification code

49c13a7

clean up from debugging

c2e10e7

jimdickinson reviewed Sep 13, 2020

View reviewed changes

johntrimble reviewed Sep 13, 2020

View reviewed changes

fix formatting

ed3fa2a

jimdickinson closed this Nov 3, 2020


		s := string(b)

		return strings.Contains(s, "PasswordAuthenticator"), nil

	func (client NodeMgmtClient) CallAlterKeyspaceEndpoint(pod corev1.Pod, keyspace string, replicationSettings[]ReplicationSetting) error {
	func (client NodeMgmtClient) CallAlterKeyspaceEndpoint(pod corev1.Pod, keyspace string, replicationSettings []ReplicationSetting) error {

Use C* authentication with Reaper #222

Use C* authentication with Reaper #222

Uh oh!

Conversation

jsanda commented Aug 25, 2020

Uh oh!

johntrimble left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsanda commented Sep 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johntrimble Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsanda Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimdickinson commented Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johntrimble Sep 13, 2020 •

edited

Loading

jsanda Sep 15, 2020 •

edited

Loading