[spanner-to-sourcedb] 1 billion row backlog accumulation load test#3769
[spanner-to-sourcedb] 1 billion row backlog accumulation load test#3769aasthabharill wants to merge 6 commits into
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (73.68%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #3769 +/- ##
============================================
+ Coverage 53.17% 53.60% +0.43%
- Complexity 6499 6696 +197
============================================
Files 1075 1086 +11
Lines 65263 66692 +1429
Branches 7239 7429 +190
============================================
+ Hits 34701 35748 +1047
- Misses 28234 28535 +301
- Partials 2328 2409 +81
🚀 New features to boost your workflow:
|
3318af3 to
5bbe6f0
Compare
7edb6da to
ebba822
Compare
ebba822 to
731188d
Compare
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
darshan-sj
left a comment
There was a problem hiding this comment.
Looks good overall. Left a few comments.
| protected void createPhysicalInstance(String instanceName) | ||
| throws IOException, InterruptedException { | ||
| String tier = sqlDialect == SQLDialect.MYSQL ? "db-n1-standard-2" : "db-custom-2-7680"; | ||
| String tier = databaseType == DatabaseType.MYSQL ? "db-n1-standard-2" : "db-custom-2-7680"; |
There was a problem hiding this comment.
Shouldn't the tier be parameter driven?
| @Category(TemplateLoadTest.class) | ||
| @TemplateLoadTest(SpannerToSourceDb.class) | ||
| @RunWith(JUnit4.class) | ||
| public class SpannerToSourceDbBacklogLT extends SpannerToSourceDbLTBase { |
There was a problem hiding this comment.
nit: SpannerToSourceDbBacklogLT -> SpannerToSourceDbLargeBacklogLT
| createAndUploadShardConfigToGcs(); | ||
|
|
||
| // Store original node counts for cleanup | ||
| originalSpannerNodeCount = getSpannerNodeCount(spannerResourceManager.getInstanceId()); |
There was a problem hiding this comment.
Which spanner instance is used?
There was a problem hiding this comment.
randomly selects one of the shared static test instances - since we dont need any particular specifications in the spanner instance except for the nodes this should be fine to reuse
| // Reset Spanner instance to its original node count if it was modified | ||
| if (originalSpannerNodeCount != null && spannerResourceManager != null) { | ||
| try { | ||
| updateSpannerNodeCount(spannerResourceManager.getInstanceId(), originalSpannerNodeCount); |
There was a problem hiding this comment.
What if originalSpannerInstance and metadataSpannerInstance are same?
| Long.parseLong( | ||
| getProperty("expectedSpannerCount", "1000000000", TestProperties.Type.PROPERTY)); | ||
| int importTimeoutMinutes = | ||
| Integer.parseInt(getProperty("importTimeoutMinutes", "120", TestProperties.Type.PROPERTY)); |
There was a problem hiding this comment.
nit: Please define all the default values used in getProperty as a private static final variables above.
| updateSpannerNodeCount(spannerMetadataResourceManager.getInstanceId(), metadataScaleNodes); | ||
|
|
||
| int reverseTimeoutMinutes = | ||
| Integer.parseInt(getProperty("reverseTimeoutMinutes", "600", TestProperties.Type.PROPERTY)); |
There was a problem hiding this comment.
How many runs did we do? Is it consistently under 10 hours?
There was a problem hiding this comment.
5 runs of RR job - all were safely under 10 hours:
- Manual Run 1 - 8 hours
- Manual Run 2 - 8 hours
- VM Run 1 - 8 hour 10 mins
- Github Workflow Run 1 - 8 hour 30 mins
- Github Workflow Run 2 - 9 hour 12 mins
| while (polledCount < metricThreshold) { | ||
| if (System.currentTimeMillis() - startTimeMillis > reverseTimeoutMinutes * 60 * 1000) { | ||
| throw new RuntimeException( | ||
| "Reverse replication load check timed out after " | ||
| + reverseTimeoutMinutes | ||
| + " minutes."); | ||
| } | ||
|
|
||
| Double successRecordsCount = | ||
| pipelineLauncher.getMetric( | ||
| project, region, reverseJobInfo.jobId(), "success_record_count"); | ||
| polledCount = successRecordsCount != null ? successRecordsCount.longValue() : 0; | ||
|
|
||
| LOG.info("--- PIPELINE PROGRESS UPDATE ---"); | ||
| LOG.info( | ||
| "Time Elapsed: {} minutes / {} minutes", | ||
| (System.currentTimeMillis() - startTimeMillis) / 60000, | ||
| reverseTimeoutMinutes); | ||
| LOG.info( | ||
| "Polled success_record_count: {}. Target threshold: {}", polledCount, metricThreshold); | ||
| LOG.info("---------------------------------"); | ||
|
|
||
| if (polledCount >= metricThreshold) { | ||
| break; | ||
| } | ||
|
|
||
| Thread.sleep( | ||
| 900000); // Poll every 15 minutes. Since the test runs for 7-8 hours, 15-minute intervals | ||
| // print exactly 4 logs per hour, preventing clutter and API call costs. | ||
| } |
There was a problem hiding this comment.
We should implement this as a ConditionCheck
|
|
||
| LOG.info( | ||
| "Database counts have not reached exact target parity yet. Retrying in 1 minute..."); | ||
| Thread.sleep(60000); // Polling retry interval of 1 minute |
There was a problem hiding this comment.
Is one minute interval between each poll enough? If the fetch itself takes ~12 minutes. Maybe we should wait more here?
There was a problem hiding this comment.
- The fetch used to take 12 minutes when executed sequentially, now because all 4 shards are queried parallelly, it takes max of 4 minutes
- The queries are collected using .join() which ensures that till all query results are returned, the code does not progress. This 1 minute sleep is just added as a delay between subsequent iterations of the while loop
- In the successful runs ive done, this while loop actually succeeds on the first try itself (because we start querying after success_record_count has already reached 1 billion), the 30 minute tries are configured just as a fallback
| public void updateSpannerNodeCount(String instanceId, int nodeCount) { | ||
| SpannerOptions options = SpannerOptions.newBuilder().setProjectId(project).build(); | ||
| try (Spanner spanner = options.getService()) { | ||
| InstanceAdminClient instanceAdminClient = spanner.getInstanceAdminClient(); | ||
|
|
||
| LOG.info("Updating Spanner instance {} node count to {}...", instanceId, nodeCount); | ||
| InstanceInfo instanceInfo = | ||
| InstanceInfo.newBuilder(InstanceId.of(project, instanceId)) | ||
| .setNodeCount(nodeCount) | ||
| .build(); | ||
| instanceAdminClient.updateInstance(instanceInfo, InstanceField.NODE_COUNT).get(); | ||
| LOG.info("Successfully updated Spanner instance {} node count to {}.", instanceId, nodeCount); | ||
| } catch (Exception e) { | ||
| LOG.error("Failed to update Spanner instance node count.", e); | ||
| throw new RuntimeException("Failed to update Spanner node count", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
We should add couple of retries here.
4ad65b7 to
d04aff3
Compare
b/496066723
This load test is to check the usecase where a customer pauses and resumes a reverse replication pipeline. In that case, there might be a massive backlog accumulated from a certain start timestamp that needs to be processed by the pipeline. This test simulates this situation by importing 1 billion 1kb rows to Spanner without an active reverse replication pipeline. Then the pipeline is started with the appropriate start timestamp. The expected result is that all the rows are successfully replicated to the CloudSQL shards.
Run time of test: around 9-10 hours
Successful Run in Github
Test Specifications
Resources created and used for this test in project
cloud-teleport-testing(More details in the last section)
FAQ
Why was a sharded setup used?
This test is very heavy on the CloudSQL instances due to massive amount of writes. If we were to use a single CloudSQL instance, the test would be extremely slow and time taking and the performance would bottleneck on the source. More shards means better performance and lesser time taken by the test. (Some manual runs were done on a non-sharded setup and they were taking 16 hours for not even the entire workload)
Why were static CloudSQL instances used?
The CloudSQL instances are resource intensive and are configured to have: 96 vCPUs, 768 GB Memory and 2.93 TB SSD storage. Dynamically spinning up CloudSQL instances, and then configuring them to have such huge resources has proven to be flaky even in manual runs. Another potential risk is the quota being downgraded due to less usage unknowingly.
Why import the data to GCS manually and not inside the test itself?
It seems like an unnecessary step that would bloat the test more. The idea of the test is to test reverse replication not BigQuery and exports. We would also have to run a script to generate the manifest file adding even more bloat.
Manual Test Runs
Run documentation + steps: https://docs.google.com/document/d/1U1NKCQl08FSfBbdUnFcccblxgHWzE-ptwZAfQY40IJI/edit?tab=t.0
Run results: go/reverse-backlog-manual-tests
Successful runs from which configuration for the test was derived:
Configuration of resources created
(Documenting steps in case of accidental deletion, etc.)
defaultsubnet-
cloudsql_iam_authentication on-
max_connections 10000: prevents connections from being a bottleneck during RR job - actual connections used are only 4000-
innodb_parallel_read_threads 96: to speed up count(*) query during final test assertion-
innodb_flush_log_at_trx_commit 2: Dramatically increases INSERT throughput by writing transaction logs to the OS cache and flushing to disk once per second, rather than on every commit.-
innodb_io_capacity 40000: Guides InnoDB to perform background page flushing more aggressively, utilizing the provisioned disk IOPS to keep the buffer pool clean and sustain write rates.-
innodb_io_capacity_max 80000: Allows InnoDB to burst beyond the baseline IOPS for flushing operations when necessary, helping to quickly clear dirty pages from the buffer pool during peak load.