Skip to content

Add compatibility test framework#205

Draft
liam-lowe wants to merge 1 commit intomainfrom
liam-lowe/compatibility-tests
Draft

Add compatibility test framework#205
liam-lowe wants to merge 1 commit intomainfrom
liam-lowe/compatibility-tests

Conversation

@liam-lowe
Copy link
Contributor

@liam-lowe liam-lowe commented Mar 20, 2026

Background

The S2S Proxy enables server-to-server communication between Temporal Servers - where each server could have differing infrastructure, security, and application configurations.

This PR outlines Compatibility Tests for the S2S proxy - with the intent to validate the compatibility of different Temporal Cluster specifications when fronted via the S2S Proxy. This type of test is crucial, as Temporal only guarantees compatibility between adjacent versions - but users of the proxy connect all sorts of server versions and configurations!

Primarily, this PR is concerned with testing compatibility of differing Temporal Server versions - but the framework is extensible to support differences across the entire cluster specification - such as database type, proxy setup, and application settings.

This PR is large - I'm sorry. I'll give a summary of my mental model, and I'll comment code in-line to make it easier to grok. Thanks!


Mental Model

When facading a Temporal Server instance with the S2S Proxy - we end up with a high-level structure where the grouping of the server and database is called a cluster, and it's fronted via a proxy.

 graph LR
     P[S2S Proxy] -->|fronts| TS[Temporal Server]
     TS --> DB[(Database)]
     subgraph Cluster
         TS
         DB
     end
Loading

When we connect multiple of these clusters together we create a topology. This topology has n Clusters paired with n proxies across a shared network. The proxies are connected to each other in a full-mesh structure, where along each
edge, one cluster acts as the mux "server" and the other the mux "client". It's a bidirectional proxy - it doesn't
matter which is denoted server/client.

As an example, here's a 3-cluster topology:

 graph TB
     subgraph A["Cluster A"]
         PA[Proxy A] --> SA[Temporal Server A]
         SA --> DBA[(Database A)]
     end
     subgraph B["Cluster B"]
         PB[Proxy B] --> SB[Temporal Server B]
         SB --> DBB[(Database B)]
     end
     subgraph C["Cluster C"]
         PC[Proxy C] --> SC[Temporal Server C]
         SC --> DBC[(Database C)]
     end
     PA <-->|mux| PB
     PB <-->|mux| PC
     PA <-->|mux| PC
Loading

This multi-cluster connectivity is possible without the S2S proxy of course! But the proxy serves as a front to configuration interface differences - Temporal server version, security, infrastructure. For a cluster I call theses configuration differences a cluster's specification - the exact configuration that defines it.

Here is an example topology where each cluster has a different specification:

 graph TB
     subgraph A["Cluster A - temporal:1.29.2 + postgres:15"]
         PA[Proxy A] --> SA[Temporal 1.29.2]
         SA --> DBA[(Postgres 15)]
     end
     subgraph B["Cluster B - temporal:1.27.4 + postgres:12"]
         PB[Proxy B] --> SB[Temporal 1.27.4]
         SB --> DBB[(Postgres 12)]
     end
     subgraph C["Cluster C - temporal:1.29.2 + postgres:15"]
         PC[Proxy C] --> SC[Temporal 1.29.2]
         SC --> DBC[(Postgres 15)]
     end
     PA <-->|mux| PB
     PB <-->|mux| PC
     PA <-->|mux| PC
Loading

As such, my semantics are as follows:

  • cluster: A Temporal Server instance paired with its backing database, representing a single logical deployment unit.
  • specification: The exact configuration that defines a cluster - server version, Docker image, database type/version, config templates, and schema setup scripts.
  • topology: The complete test environment - N Clusters each fronted by an S2S Proxy, all joined to a shared Docker network, with proxies interconnected in a full-mesh arrangement.

Code Structure

This PR is modelled on the same mental model outlined above.

The Compatibility Test Suite defines three things:

  1. A series of specifications that can be composed into a topology.
  2. A series of topologies that can be tested.
  3. A series of tests to run against each topology.

As a result, this PR introduces the following structure:

 tests/compatibility/
     specifications/   # Cluster configurations (image, DB, setup scripts)
     topology/         # Logic to instantiate a topology from specifications
     suite/            # Test suites to run against any topology
     matrix/           # Topology combinations and test orchestration

Specifications

Each specification is structured in a self-contained package:

 tests/compatibility/
     specifications/
         temporal_1_27_4_postgres12/
         temporal_1_29_2_postgres15/
         ...

Each package exports a New() ClusterSpec function and bundles the server config template, schema setup script, and
Docker image reference for that exact server+database combination. Adding a new specification is as simple as adding a new package here.

Topology

The topology package contains the logic to build a running topology from a list of ClusterSpec values. It:

  1. Creates a shared Docker network
  2. Starts Database containers for each cluster
  3. Starts Temporal Server containers
  4. Starts the S2S Proxy containers
  5. Generates Multi-Cluster Temporal Topology using Proxies.

Understanding this PR

Sorry - it's large. Here's where I'd start:

tests/compatibility/matrix/matrix_test.go

Here is where we define the entry point for a topology.

For example, this is testing two 1.29.2 clusters:

func TestTopology_1_29_2_same(t *testing.T) {
	t.Parallel()
	run(t, []specifications.ClusterSpec{
		temporal_1_29_2_postgres15.New(),
		temporal_1_29_2_postgres15.New(),
	})
}

tests/compatibility/specifications/temporal_1_29_2_postgres15/config.go

Here we define the cluster specification for this 1.29.2 cluster:

func New() specifications.ClusterSpec {
	return specifications.ClusterSpec{
		Server: specifications.ServerSpec{
			Image:               "temporalio/server:1.29.2",
			AdminToolsImage:     "temporalio/admin-tools:1.29",
			ConfigTemplate:      serverConfigTemplate,
			SetupSchemaTemplate: setupSchemaTemplate,
		},
		Database: specifications.DatabaseSpec{
			Image: "postgres:15",
		},
	}
}

tests/compatibility/matrix/run.go

Here we define the exact tests being run on each topology:

func run(t *testing.T, specs []specifications.ClusterSpec) {
	t.Helper()

	top := topology.NewTopology(t, newTopologyID(), specs)

	healthPassed := t.Run("Health", func(t *testing.T) {
		for _, cluster := range top.Clusters() {
			if !t.Run(cluster.Name(), func(t *testing.T) { suite.RunClusterHealthSuite(t, top, cluster) }) {
				return
			}
		}

		for _, proxy := range top.Proxies() {
			if !t.Run(proxy.Name(), func(t *testing.T) { suite.RunProxyHealthSuite(t, top, proxy) }) {
				return
			}
		}
	})

	// If health checks fail, we skip the rest of the tests to avoid misleading failures.
	// If health / Proxy tests fail, this is likely indication of cluster configuration issues.
	if !healthPassed {
		return
	}

	t.Run("Connectivity", func(t *testing.T) {
		for _, proxy := range top.Proxies() {
			t.Run(proxy.Name(), func(t *testing.T) {
				suite.RunConnectivitySuite(t, top, proxy)
			})
		}
	})

	t.Run("Replication", func(t *testing.T) {
		for _, cluster := range top.Clusters() {
			t.Run(cluster.Name(), func(t *testing.T) {
				suite.RunReplicationSuite(t, top, cluster)
			})
		}
	})
}
tests/compatibility/suite/suite_replication.go

Here is an example of a test, that would be run against a topology.

func (s *ReplicationSuite) TestWorkflowReplication() {
	ctx := context.Background()

	// Step 1: Register a global namespace with s.active as the initial active cluster.
	ns := s.Ops().RegisterNamespace(ctx, s.active, "compatibility-wf-ns")

	// Step 2: Validate that the active cluster for this namespace is the one we passed in.
	active := s.Ops().Active(ns)
	s.Require().NotNil(active)
	s.Require().Equal(s.active.Name(), active.Name())

	// Step 3: Wait for namespace replication to all passive clusters.
	passives := s.Ops().Passives(ns)
	for _, c := range passives {
		s.Ops().WaitForNamespace(c, ns)
	}

	// Step 4: Start a workflow on the active cluster.
	workflowID := fmt.Sprintf("compatibility-wf-%s-%d", s.active.Name(), time.Now().UnixNano())
	startCtx, cancel := context.WithTimeout(ctx, writeTimeout)
	defer cancel()
	_, err := s.active.FrontendClient().StartWorkflowExecution(startCtx,
		&workflowservicev1.StartWorkflowExecutionRequest{
			Namespace:    ns,
			WorkflowId:   workflowID,
			WorkflowType: &commonv1.WorkflowType{Name: "compatibility-wf"},
			TaskQueue:    &taskqueuev1.TaskQueue{Name: "compatibility-tq"},
			RequestId:    workflowID + "-start",
		})
	s.Require().NoError(err, "start workflow on %s", s.active.Name())

	// Step 5: Verify workflow history has replicated to all passive clusters.
	s.Ops().WaitForWorkflowVisible(ctx, passives, ns, workflowID)

	// Step 6: Terminate the workflow on the active cluster.
	s.Ops().TerminateWorkflow(ctx, s.active, ns, workflowID)

	// Step 7: Verify termination has replicated to all passive clusters.
	s.Ops().WaitForWorkflowTerminated(ctx, passives, ns, workflowID)
}
tests/compatibility/topology/topology.go

Here is where we actually create the topology.

func NewTopology(t *testing.T, topologyID string, specs []specifications.ClusterSpec) Topology {

	// Step 1. Network: Create a docker network.

	// Step 2a. Clusters: Start all clusters in parallel.
	// Step 2b. Clusters: Wait for all clusters to start.

	// Step 3a. Proxies: Build proxy image.
	// Step 3b. Proxies: Build proxy specs (full-mesh, fully rendered).
	// Step 3c. Proxies: Start all proxies in parallel.
	// Step 3d. Proxies: Wait for all proxies to start.

	// Step 4. Connectivity: Register each remote cluster via proxies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant