Skip to content

[spanner-to-sourcedb] 1 billion row backlog accumulation load test#3769

Open
aasthabharill wants to merge 6 commits into
mainfrom
aastha-reverse-backlog-lt
Open

[spanner-to-sourcedb] 1 billion row backlog accumulation load test#3769
aasthabharill wants to merge 6 commits into
mainfrom
aastha-reverse-backlog-lt

Conversation

@aasthabharill
Copy link
Copy Markdown
Member

@aasthabharill aasthabharill commented May 7, 2026

b/496066723

This load test is to check the usecase where a customer pauses and resumes a reverse replication pipeline. In that case, there might be a massive backlog accumulated from a certain start timestamp that needs to be processed by the pipeline. This test simulates this situation by importing 1 billion 1kb rows to Spanner without an active reverse replication pipeline. Then the pipeline is started with the appropriate start timestamp. The expected result is that all the rows are successfully replicated to the CloudSQL shards.

Run time of test: around 9-10 hours

Successful Run in Github

Test Specifications

  • Row Count: 1 billion rows
  • Row size: 1KB
  • Time of test completion: around 9 hours
  • Time split: 45 minutes of data population in spanner + 8 hours of Reverse Replication job
  • Weekly recurring load test
  • Test steps:
    • Done manually on the project: Export data from BigQuery to GCS in Avro format
    • Data Population: Dataflow job to Import avro files to Spanner
    • Start Reverse Replication Job
    • Assert on row count for all shards

Resources created and used for this test in project cloud-teleport-testing

(More details in the last section)

FAQ

Why was a sharded setup used?

This test is very heavy on the CloudSQL instances due to massive amount of writes. If we were to use a single CloudSQL instance, the test would be extremely slow and time taking and the performance would bottleneck on the source. More shards means better performance and lesser time taken by the test. (Some manual runs were done on a non-sharded setup and they were taking 16 hours for not even the entire workload)

Why were static CloudSQL instances used?

The CloudSQL instances are resource intensive and are configured to have: 96 vCPUs, 768 GB Memory and 2.93 TB SSD storage. Dynamically spinning up CloudSQL instances, and then configuring them to have such huge resources has proven to be flaky even in manual runs. Another potential risk is the quota being downgraded due to less usage unknowingly.

Why import the data to GCS manually and not inside the test itself?

It seems like an unnecessary step that would bloat the test more. The idea of the test is to test reverse replication not BigQuery and exports. We would also have to run a script to generate the manifest file adding even more bloat.

Manual Test Runs

Run documentation + steps: https://docs.google.com/document/d/1U1NKCQl08FSfBbdUnFcccblxgHWzE-ptwZAfQY40IJI/edit?tab=t.0

Run results: go/reverse-backlog-manual-tests

Successful runs from which configuration for the test was derived:

Job Duration Spanner Nodes Spanner CPU DF Worker type numWorkers Stable DF workers Stable Throughput CPU Util
Import Job 2 41 mins 25 50% n2-highmem-8 80 90 750k/s <15%

Job Duration Spanner Nodes Spanner CPU Metadata nodes Metadata CPU SQL vCPU SQL MaxConnections (per shard) maxShardConnections SQL connections used SQL CPU util DF Worker type numWorkers maxWorkers Stable DF workers Worker CPU Stable Throughput
Sharded RR Job 5 7.5 hours 5 <10% 20 40% 96 100000 2000 4000 60% n2-highmem-8 200 200 200 10% 50k

Configuration of resources created

(Documenting steps in case of accidental deletion, etc.)

  • The Avro data in the GCS bucket was generated using the queries present in the Load Test Run Doc after changing the parameters. The steps to generate the manifest file is also documented.
  • This is the configuration of the CloudSQL instances that were created:
    • vCPUs: 96
    • Memory: 768 GB
    • SSD storage: 2.93 TB
    • Enterprise Plus edition
    • Private IP with default subnet
    • Database flags and parameters
      - cloudsql_iam_authentication on
      - max_connections 10000 : prevents connections from being a bottleneck during RR job - actual connections used are only 4000
      - innodb_parallel_read_threads 96 : to speed up count(*) query during final test assertion
      - innodb_flush_log_at_trx_commit 2 : Dramatically increases INSERT throughput by writing transaction logs to the OS cache and flushing to disk once per second, rather than on every commit.
      - innodb_io_capacity 40000 : Guides InnoDB to perform background page flushing more aggressively, utilizing the provisioned disk IOPS to keep the buffer pool clean and sustain write rates.
      - innodb_io_capacity_max 80000 : Allows InnoDB to burst beyond the baseline IOPS for flushing operations when necessary, helping to quickly clear dirty pages from the buffer pool during peak load.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.60%. Comparing base (1826073) to head (d04aff3).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
...eam/it/gcp/cloudsql/CloudSqlShardOrchestrator.java 73.68% 3 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (73.68%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3769      +/-   ##
============================================
+ Coverage     53.17%   53.60%   +0.43%     
- Complexity     6499     6696     +197     
============================================
  Files          1075     1086      +11     
  Lines         65263    66692    +1429     
  Branches       7239     7429     +190     
============================================
+ Hits          34701    35748    +1047     
- Misses        28234    28535     +301     
- Partials       2328     2409      +81     
Components Coverage Δ
spanner-templates 72.80% <ø> (-0.02%) ⬇️
spanner-import-export 68.54% <ø> (-0.10%) ⬇️
spanner-live-forward-migration 80.94% <ø> (ø)
spanner-live-reverse-replication 77.09% <ø> (+0.02%) ⬆️
spanner-bulk-migration 91.11% <ø> (ø)
gcs-spanner-dv 85.76% <ø> (ø)
Files with missing lines Coverage Δ
...eam/it/gcp/cloudsql/CloudSqlShardOrchestrator.java 86.34% <73.68%> (ø)

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aasthabharill aasthabharill added the addition New feature or request label May 7, 2026
@aasthabharill aasthabharill changed the title setup test [spanner-to-sourcedb] 1 billion row backlog accumulation load test May 8, 2026
@aasthabharill aasthabharill force-pushed the aastha-reverse-backlog-lt branch from 3318af3 to 5bbe6f0 Compare May 8, 2026 07:39
@aasthabharill aasthabharill force-pushed the aastha-reverse-backlog-lt branch 3 times, most recently from 7edb6da to ebba822 Compare May 9, 2026 07:19
@aasthabharill aasthabharill force-pushed the aastha-reverse-backlog-lt branch from ebba822 to 731188d Compare May 10, 2026 04:49
@aasthabharill aasthabharill marked this pull request as ready for review May 10, 2026 04:50
@aasthabharill aasthabharill requested a review from a team as a code owner May 10, 2026 04:50
@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

Copy link
Copy Markdown
Contributor

@darshan-sj darshan-sj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Left a few comments.

protected void createPhysicalInstance(String instanceName)
throws IOException, InterruptedException {
String tier = sqlDialect == SQLDialect.MYSQL ? "db-n1-standard-2" : "db-custom-2-7680";
String tier = databaseType == DatabaseType.MYSQL ? "db-n1-standard-2" : "db-custom-2-7680";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the tier be parameter driven?

@Category(TemplateLoadTest.class)
@TemplateLoadTest(SpannerToSourceDb.class)
@RunWith(JUnit4.class)
public class SpannerToSourceDbBacklogLT extends SpannerToSourceDbLTBase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SpannerToSourceDbBacklogLT -> SpannerToSourceDbLargeBacklogLT

createAndUploadShardConfigToGcs();

// Store original node counts for cleanup
originalSpannerNodeCount = getSpannerNodeCount(spannerResourceManager.getInstanceId());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which spanner instance is used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomly selects one of the shared static test instances - since we dont need any particular specifications in the spanner instance except for the nodes this should be fine to reuse

// Reset Spanner instance to its original node count if it was modified
if (originalSpannerNodeCount != null && spannerResourceManager != null) {
try {
updateSpannerNodeCount(spannerResourceManager.getInstanceId(), originalSpannerNodeCount);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if originalSpannerInstance and metadataSpannerInstance are same?

Long.parseLong(
getProperty("expectedSpannerCount", "1000000000", TestProperties.Type.PROPERTY));
int importTimeoutMinutes =
Integer.parseInt(getProperty("importTimeoutMinutes", "120", TestProperties.Type.PROPERTY));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please define all the default values used in getProperty as a private static final variables above.

updateSpannerNodeCount(spannerMetadataResourceManager.getInstanceId(), metadataScaleNodes);

int reverseTimeoutMinutes =
Integer.parseInt(getProperty("reverseTimeoutMinutes", "600", TestProperties.Type.PROPERTY));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many runs did we do? Is it consistently under 10 hours?

Copy link
Copy Markdown
Member Author

@aasthabharill aasthabharill May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 runs of RR job - all were safely under 10 hours:

  1. Manual Run 1 - 8 hours
  2. Manual Run 2 - 8 hours
  3. VM Run 1 - 8 hour 10 mins
  4. Github Workflow Run 1 - 8 hour 30 mins
  5. Github Workflow Run 2 - 9 hour 12 mins

Comment on lines +270 to +299
while (polledCount < metricThreshold) {
if (System.currentTimeMillis() - startTimeMillis > reverseTimeoutMinutes * 60 * 1000) {
throw new RuntimeException(
"Reverse replication load check timed out after "
+ reverseTimeoutMinutes
+ " minutes.");
}

Double successRecordsCount =
pipelineLauncher.getMetric(
project, region, reverseJobInfo.jobId(), "success_record_count");
polledCount = successRecordsCount != null ? successRecordsCount.longValue() : 0;

LOG.info("--- PIPELINE PROGRESS UPDATE ---");
LOG.info(
"Time Elapsed: {} minutes / {} minutes",
(System.currentTimeMillis() - startTimeMillis) / 60000,
reverseTimeoutMinutes);
LOG.info(
"Polled success_record_count: {}. Target threshold: {}", polledCount, metricThreshold);
LOG.info("---------------------------------");

if (polledCount >= metricThreshold) {
break;
}

Thread.sleep(
900000); // Poll every 15 minutes. Since the test runs for 7-8 hours, 15-minute intervals
// print exactly 4 logs per hour, preventing clutter and API call costs.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should implement this as a ConditionCheck


LOG.info(
"Database counts have not reached exact target parity yet. Retrying in 1 minute...");
Thread.sleep(60000); // Polling retry interval of 1 minute
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is one minute interval between each poll enough? If the fetch itself takes ~12 minutes. Maybe we should wait more here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The fetch used to take 12 minutes when executed sequentially, now because all 4 shards are queried parallelly, it takes max of 4 minutes
  • The queries are collected using .join() which ensures that till all query results are returned, the code does not progress. This 1 minute sleep is just added as a delay between subsequent iterations of the while loop
  • In the successful runs ive done, this while loop actually succeeds on the first try itself (because we start querying after success_record_count has already reached 1 billion), the 30 minute tries are configured just as a fallback

Comment on lines +503 to +519
public void updateSpannerNodeCount(String instanceId, int nodeCount) {
SpannerOptions options = SpannerOptions.newBuilder().setProjectId(project).build();
try (Spanner spanner = options.getService()) {
InstanceAdminClient instanceAdminClient = spanner.getInstanceAdminClient();

LOG.info("Updating Spanner instance {} node count to {}...", instanceId, nodeCount);
InstanceInfo instanceInfo =
InstanceInfo.newBuilder(InstanceId.of(project, instanceId))
.setNodeCount(nodeCount)
.build();
instanceAdminClient.updateInstance(instanceInfo, InstanceField.NODE_COUNT).get();
LOG.info("Successfully updated Spanner instance {} node count to {}.", instanceId, nodeCount);
} catch (Exception e) {
LOG.error("Failed to update Spanner instance node count.", e);
throw new RuntimeException("Failed to update Spanner node count", e);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add couple of retries here.

@aasthabharill aasthabharill force-pushed the aastha-reverse-backlog-lt branch from 4ad65b7 to d04aff3 Compare May 13, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

addition New feature or request size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants