[spanner-to-sourcedb] 1 billion row backlog accumulation load test by aasthabharill · Pull Request #3769 · GoogleCloudPlatform/DataflowTemplates

aasthabharill · 2026-05-07T11:32:45Z

This load test is to check the usecase where a customer pauses and resumes a reverse replication pipeline. In that case, there might be a massive backlog accumulated from a certain start timestamp that needs to be processed by the pipeline. This test simulates this situation by importing 1 billion 1kb rows to Spanner without an active reverse replication pipeline. Then the pipeline is started with the appropriate start timestamp. The expected result is that all the rows are successfully replicated to the CloudSQL shards.

Run time of test: around 9-10 hours

Successful Run in Github

Test Specifications

Row Count: 1 billion rows
Row size: 1KB
Time of test completion: around 9 hours
Time split: 45 minutes of data population in spanner + 8 hours of Reverse Replication job
Weekly recurring load test
Test steps:
- Done manually on the project: Export data from BigQuery to GCS in Avro format
- Data Population: Dataflow job to Import avro files to Spanner
- Start Reverse Replication Job
- Assert on row count for all shards

Resources created and used for this test in project `cloud-teleport-testing`

Bucket with 1 billion rows (avro files): gs://nokill-spanner-to-sourcedb-load/data
CloudSQL instance for physical shard 1: nokill-high-resources-backlog-shard1
CloudSQL instance for physical shard 2: nokill-high-resources-backlog-shard2

(More details in the last section)

FAQ

Why was a sharded setup used?

This test is very heavy on the CloudSQL instances due to massive amount of writes. If we were to use a single CloudSQL instance, the test would be extremely slow and time taking and the performance would bottleneck on the source. More shards means better performance and lesser time taken by the test. (Some manual runs were done on a non-sharded setup and they were taking 16 hours for not even the entire workload)

Why were static CloudSQL instances used?

The CloudSQL instances are resource intensive and are configured to have: 96 vCPUs, 768 GB Memory and 2.93 TB SSD storage. Dynamically spinning up CloudSQL instances, and then configuring them to have such huge resources has proven to be flaky even in manual runs. Another potential risk is the quota being downgraded due to less usage unknowingly.

Why import the data to GCS manually and not inside the test itself?

It seems like an unnecessary step that would bloat the test more. The idea of the test is to test reverse replication not BigQuery and exports. We would also have to run a script to generate the manifest file adding even more bloat.

Manual Test Runs

Run documentation + steps: https://docs.google.com/document/d/1U1NKCQl08FSfBbdUnFcccblxgHWzE-ptwZAfQY40IJI/edit?tab=t.0

Run results: go/reverse-backlog-manual-tests

Successful runs from which configuration for the test was derived:

Job	Duration	Spanner Nodes	Spanner CPU	DF Worker type	numWorkers	Stable DF workers	Stable Throughput	CPU Util
Import Job 2	41 mins	25	50%	n2-highmem-8	80	90	750k/s	<15%

Job	Duration	Spanner Nodes	Spanner CPU	Metadata nodes	Metadata CPU	SQL vCPU	SQL MaxConnections (per shard)	maxShardConnections	SQL connections used	SQL CPU util	DF Worker type	numWorkers	maxWorkers	Stable DF workers	Worker CPU	Stable Throughput
Sharded RR Job 5	7.5 hours	5	<10%	20	40%	96	100000	2000	4000	60%	n2-highmem-8	200	200	200	10%	50k

Configuration of resources created

(Documenting steps in case of accidental deletion, etc.)

The Avro data in the GCS bucket was generated using the queries present in the Load Test Run Doc after changing the parameters. The steps to generate the manifest file is also documented.
This is the configuration of the CloudSQL instances that were created:
- vCPUs: 96
- Memory: 768 GB
- SSD storage: 2.93 TB
- Enterprise Plus edition
- Private IP with default subnet
- Database flags and parameters
  - cloudsql_iam_authentication on
  - max_connections 10000 : prevents connections from being a bottleneck during RR job - actual connections used are only 4000
  - innodb_parallel_read_threads 96 : to speed up count(*) query during final test assertion
  - innodb_flush_log_at_trx_commit 2 : Dramatically increases INSERT throughput by writing transaction logs to the OS cache and flushing to disk once per second, rather than on every commit.
  - innodb_io_capacity 40000 : Guides InnoDB to perform background page flushing more aggressively, utilizing the provisioned disk IOPS to keep the buffer pool clean and sustain write rates.
  - innodb_io_capacity_max 80000 : Allows InnoDB to burst beyond the baseline IOPS for flushing operations when necessary, helping to quickly clear dirty pages from the buffer pool during peak load.

codecov · 2026-05-07T12:47:05Z

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.60%. Comparing base (1826073) to head (d04aff3).
⚠️ Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
...eam/it/gcp/cloudsql/CloudSqlShardOrchestrator.java	73.68%	3 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (73.68%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3769      +/-   ##
============================================
+ Coverage     53.17%   53.60%   +0.43%     
- Complexity     6499     6696     +197     
============================================
  Files          1075     1086      +11     
  Lines         65263    66692    +1429     
  Branches       7239     7429     +190     
============================================
+ Hits          34701    35748    +1047     
- Misses        28234    28535     +301     
- Partials       2328     2409      +81

Components	Coverage Δ
spanner-templates	`72.80% <ø> (-0.02%)`	⬇️
spanner-import-export	`68.54% <ø> (-0.10%)`	⬇️
spanner-live-forward-migration	`80.94% <ø> (ø)`
spanner-live-reverse-replication	`77.09% <ø> (+0.02%)`	⬆️
spanner-bulk-migration	`91.11% <ø> (ø)`
gcs-spanner-dv	`85.76% <ø> (ø)`

Files with missing lines	Coverage Δ
...eam/it/gcp/cloudsql/CloudSqlShardOrchestrator.java	`86.34% <73.68%> (ø)`

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gemini-code-assist · 2026-05-10T05:07:05Z

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

darshan-sj

Looks good overall. Left a few comments.

darshan-sj · 2026-05-12T04:22:51Z

  protected void createPhysicalInstance(String instanceName)
      throws IOException, InterruptedException {
-    String tier = sqlDialect == SQLDialect.MYSQL ? "db-n1-standard-2" : "db-custom-2-7680";
+    String tier = databaseType == DatabaseType.MYSQL ? "db-n1-standard-2" : "db-custom-2-7680";


Shouldn't the tier be parameter driven?

darshan-sj · 2026-05-12T04:25:07Z

+@Category(TemplateLoadTest.class)
+@TemplateLoadTest(SpannerToSourceDb.class)
+@RunWith(JUnit4.class)
+public class SpannerToSourceDbBacklogLT extends SpannerToSourceDbLTBase {


nit: SpannerToSourceDbBacklogLT -> SpannerToSourceDbLargeBacklogLT

darshan-sj · 2026-05-12T04:27:24Z

+    createAndUploadShardConfigToGcs();
+
+    // Store original node counts for cleanup
+    originalSpannerNodeCount = getSpannerNodeCount(spannerResourceManager.getInstanceId());


Which spanner instance is used?

randomly selects one of the shared static test instances - since we dont need any particular specifications in the spanner instance except for the nodes this should be fine to reuse

darshan-sj · 2026-05-12T04:28:03Z

+    // Reset Spanner instance to its original node count if it was modified
+    if (originalSpannerNodeCount != null && spannerResourceManager != null) {
+      try {
+        updateSpannerNodeCount(spannerResourceManager.getInstanceId(), originalSpannerNodeCount);


What if originalSpannerInstance and metadataSpannerInstance are same?

darshan-sj · 2026-05-12T04:32:20Z

+        Long.parseLong(
+            getProperty("expectedSpannerCount", "1000000000", TestProperties.Type.PROPERTY));
+    int importTimeoutMinutes =
+        Integer.parseInt(getProperty("importTimeoutMinutes", "120", TestProperties.Type.PROPERTY));


nit: Please define all the default values used in getProperty as a private static final variables above.

darshan-sj · 2026-05-12T04:33:46Z

+    updateSpannerNodeCount(spannerMetadataResourceManager.getInstanceId(), metadataScaleNodes);
+
+    int reverseTimeoutMinutes =
+        Integer.parseInt(getProperty("reverseTimeoutMinutes", "600", TestProperties.Type.PROPERTY));


How many runs did we do? Is it consistently under 10 hours?

5 runs of RR job - all were safely under 10 hours:

Manual Run 1 - 8 hours

Manual Run 2 - 8 hours

VM Run 1 - 8 hour 10 mins

Github Workflow Run 1 - 8 hour 30 mins

Github Workflow Run 2 - 9 hour 12 mins

darshan-sj · 2026-05-12T04:39:59Z

+    while (polledCount < metricThreshold) {
+      if (System.currentTimeMillis() - startTimeMillis > reverseTimeoutMinutes * 60 * 1000) {
+        throw new RuntimeException(
+            "Reverse replication load check timed out after "
+                + reverseTimeoutMinutes
+                + " minutes.");
+      }
+
+      Double successRecordsCount =
+          pipelineLauncher.getMetric(
+              project, region, reverseJobInfo.jobId(), "success_record_count");
+      polledCount = successRecordsCount != null ? successRecordsCount.longValue() : 0;
+
+      LOG.info("--- PIPELINE PROGRESS UPDATE ---");
+      LOG.info(
+          "Time Elapsed: {} minutes / {} minutes",
+          (System.currentTimeMillis() - startTimeMillis) / 60000,
+          reverseTimeoutMinutes);
+      LOG.info(
+          "Polled success_record_count: {}. Target threshold: {}", polledCount, metricThreshold);
+      LOG.info("---------------------------------");
+
+      if (polledCount >= metricThreshold) {
+        break;
+      }
+
+      Thread.sleep(
+          900000); // Poll every 15 minutes. Since the test runs for 7-8 hours, 15-minute intervals
+      // print exactly 4 logs per hour, preventing clutter and API call costs.
+    }


We should implement this as a ConditionCheck

darshan-sj · 2026-05-12T04:43:32Z

+
+        LOG.info(
+            "Database counts have not reached exact target parity yet. Retrying in 1 minute...");
+        Thread.sleep(60000); // Polling retry interval of 1 minute


Is one minute interval between each poll enough? If the fetch itself takes ~12 minutes. Maybe we should wait more here?

The fetch used to take 12 minutes when executed sequentially, now because all 4 shards are queried parallelly, it takes max of 4 minutes

The queries are collected using .join() which ensures that till all query results are returned, the code does not progress. This 1 minute sleep is just added as a delay between subsequent iterations of the while loop

In the successful runs ive done, this while loop actually succeeds on the first try itself (because we start querying after success_record_count has already reached 1 billion), the 30 minute tries are configured just as a fallback

darshan-sj · 2026-05-12T04:45:06Z

+  public void updateSpannerNodeCount(String instanceId, int nodeCount) {
+    SpannerOptions options = SpannerOptions.newBuilder().setProjectId(project).build();
+    try (Spanner spanner = options.getService()) {
+      InstanceAdminClient instanceAdminClient = spanner.getInstanceAdminClient();
+
+      LOG.info("Updating Spanner instance {} node count to {}...", instanceId, nodeCount);
+      InstanceInfo instanceInfo =
+          InstanceInfo.newBuilder(InstanceId.of(project, instanceId))
+              .setNodeCount(nodeCount)
+              .build();
+      instanceAdminClient.updateInstance(instanceInfo, InstanceField.NODE_COUNT).get();
+      LOG.info("Successfully updated Spanner instance {} node count to {}.", instanceId, nodeCount);
+    } catch (Exception e) {
+      LOG.error("Failed to update Spanner instance node count.", e);
+      throw new RuntimeException("Failed to update Spanner node count", e);
+    }
+  }


We should add couple of retries here.

setup test

c95e8d7

pull-request-size Bot added the size/XL label May 7, 2026

import test

f1289c4

reverse test

a45aef1

aasthabharill added the addition New feature or request label May 7, 2026

final test

c5046d3

pull-request-size Bot added size/XXL and removed size/XL labels May 7, 2026

aasthabharill changed the title ~~setup test~~ [spanner-to-sourcedb] 1 billion row backlog accumulation load test May 8, 2026

aasthabharill force-pushed the aastha-reverse-backlog-lt branch from 3318af3 to 5bbe6f0 Compare May 8, 2026 07:39

pull-request-size Bot added size/XL and removed size/XXL labels May 8, 2026

aasthabharill force-pushed the aastha-reverse-backlog-lt branch 3 times, most recently from 7edb6da to ebba822 Compare May 9, 2026 07:19

pull-request-size Bot added size/XXL and removed size/XL labels May 9, 2026

productionized test

731188d

aasthabharill force-pushed the aastha-reverse-backlog-lt branch from ebba822 to 731188d Compare May 10, 2026 04:49

pull-request-size Bot added size/XL and removed size/XXL labels May 10, 2026

aasthabharill marked this pull request as ready for review May 10, 2026 04:50

aasthabharill requested a review from a team as a code owner May 10, 2026 04:50

aasthabharill requested review from darshan-sj, shreyakhajanchi and sm745052 May 10, 2026 04:50

darshan-sj reviewed May 12, 2026

View reviewed changes

pull-request-size Bot removed the size/XL label May 13, 2026

pull-request-size Bot added the size/XXL label May 13, 2026

aasthabharill requested a review from darshan-sj May 13, 2026 06:14

review changes

d04aff3

aasthabharill force-pushed the aastha-reverse-backlog-lt branch from 4ad65b7 to d04aff3 Compare May 13, 2026 06:18

Conversation

aasthabharill commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Specifications

Resources created and used for this test in project cloud-teleport-testing

FAQ

Why was a sharded setup used?

Why were static CloudSQL instances used?

Why import the data to GCS manually and not inside the test itself?

Manual Test Runs

Configuration of resources created

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gemini-code-assist Bot commented May 10, 2026

Uh oh!

darshan-sj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aasthabharill May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aasthabharill commented May 7, 2026 •

edited

Loading

Resources created and used for this test in project `cloud-teleport-testing`

codecov Bot commented May 7, 2026 •

edited

Loading

aasthabharill May 13, 2026 •

edited

Loading