Batch and Write transform code by shreyakhajanchi · Pull Request #3797 · GoogleCloudPlatform/DataflowTemplates

shreyakhajanchi · 2026-05-12T08:10:25Z

No description provided.

gemini-code-assist · 2026-05-13T15:22:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements the core batching and writing logic for the CDC data generator. It introduces a stateful DoFn to manage record processing, batching, and sink interaction, along with a PTransform to integrate this logic into the broader pipeline. The changes also include a new data model for records and robust unit testing to validate the new components.

Highlights

New Data Transformation Pipeline: Introduced the BatchAndWrite PTransform and corresponding BatchAndWriteFn DoFn to handle batching and writing of generated records to configured sinks.
Data Model Enhancement: Added the GeneratedRecord class as a type-safe container for table names and primary key values.
Testing Infrastructure: Added comprehensive unit tests for both the BatchAndWrite transform and the BatchAndWriteFn logic to ensure correct behavior and lifecycle management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the BatchAndWrite transform and its corresponding DoFn, BatchAndWriteFn, which manage stateful data generation and writing to MySQL and Spanner sinks. The changes include the GeneratedRecord data model and comprehensive unit tests. Review feedback highlights potential NullPointerException risks in the BatchAndWriteFn constructor due to auto-unboxing of nullable Integer parameters and suggests providing default values. Additionally, a test case was identified as misleading because it fails to assert the specific fallback behavior it claims to verify.

codecov · 2026-05-13T15:36:53Z

Codecov Report

❌ Patch coverage is 88.49558% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.79%. Comparing base (0025d7e) to head (f4a73ec).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...ud/teleport/v2/templates/dofn/BatchAndWriteFn.java	89.71%	8 Missing and 3 partials ⚠️
.../teleport/v2/templates/sink/DataWriterFactory.java	50.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3797      +/-   ##
============================================
+ Coverage     53.62%   53.79%   +0.17%     
- Complexity     6323     6368      +45     
============================================
  Files          1087     1090       +3     
  Lines         66762    66917     +155     
  Branches       7476     7485       +9     
============================================
+ Hits          35801    36001     +200     
+ Misses        28534    28483      -51     
- Partials       2427     2433       +6

Components	Coverage Δ
spanner-templates	`72.84% <ø> (+0.04%)`	⬆️
spanner-import-export	`68.66% <ø> (+0.13%)`	⬆️
spanner-live-forward-migration	`80.93% <ø> (-0.02%)`	⬇️
spanner-live-reverse-replication	`77.08% <ø> (-0.02%)`	⬇️
spanner-bulk-migration	`91.10% <ø> (-0.01%)`	⬇️
gcs-spanner-dv	`85.74% <ø> (-0.02%)`	⬇️

Files with missing lines	Coverage Δ
...d/teleport/v2/templates/model/GeneratedRecord.java	`100.00% <100.00%> (ø)`
.../teleport/v2/templates/sink/DataWriterFactory.java	`50.00% <50.00%> (ø)`
...ud/teleport/v2/templates/dofn/BatchAndWriteFn.java	`89.71% <89.71%> (ø)`

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

manitgupta · 2026-05-14T04:31:59Z

+  private transient volatile DataGeneratorSchema schema;
+  private transient volatile List<String> insertTopoOrder;


What does volatile do here?

Using volatile guarantees that once schema and insertTopoOrder are published by the first bundle, all subsequent bundles and timer callbacks on that worker instance will deterministically observe the fully constructed objects

manitgupta · 2026-05-14T04:34:12Z

+  protected DataWriter createWriter(SinkType type, String configPath) {
+    switch (type) {
+      case MYSQL:
+        return new MySqlDataWriter(configPath);
+      case SPANNER:
+        return new SpannerDataWriter(configPath);
+      default:
+        throw new IllegalArgumentException("Unsupported sink type: " + type);
+    }
+  }


Can consider moving this out to a factory for cleaner separation. In future, constructing writers may become more complex and segregation will help.

makes sense, will move

manitgupta · 2026-05-14T04:35:47Z

+              FailureRecord.toJson(
+                  "UNKNOWN_TABLE", FailureRecord.OPERATION_GENERATION, null, timerError));


Why is this always UNKNOWN_TABLE?

All event-specific and table-specific errors are caught inside DataGeneratorEngine, any exception that escapes to the outer catch (Exception timerError) block in BatchAndWriteFn.java represents a system failure hence it doesn't have a table name.

manitgupta · 2026-05-14T04:38:04Z

+  private void writeFailedRecords(Consumer<String> sink) {
+    List<String> dlq = batcher.getFailedRecords();
+    if (dlq == null || dlq.isEmpty()) {
+      return;
+    }
+    for (String record : dlq) {
+      sink.accept(record);
+    }
+    batcher.clearDlq();
+  }


I didn't get what this method is doing (I don't fully understand the Consumer function). Where are the failed records from the DLQ being written to?

Also, what is the function of the DLQ? Are users expected to re-run the DLQ, or is it more for reporting purpose?

It is outputting the failed records from the dofn, it is written to a gcs file and is only for reporting not retrying

manitgupta · 2026-05-14T04:40:17Z

+
+/** Type-safe container wrapping table names and primary key values. */
+@AutoValue
+public abstract class GeneratedRecord implements Serializable {


The more Serializable you introduce now the harder it will be to properly clean it up later...just keep it in mind.

I understand, but i have retained it in classes which use beam row, because my analysis suggests that it might require custom coder and solving it for 1 class will help me solve for all others so I would like to take them up at once

manitgupta · 2026-05-14T04:41:43Z

+public class BatchAndWrite
+    extends PTransform<PCollection<KV<Integer, GeneratedRecord>>, PCollection<String>> {


Seems to only contain the BatchAndWriteFn. Given the strictly coupled/identical naming (BatchAndWrite and BatchAndWriteFn), do we need this transform at all? Can we wire the DoFn directly? Is there some future extensibility use-case?

shreyakhajanchi added 3 commits May 12, 2026 00:56

Adding data generator engine in cdc data generator

cebfe8f

fixed dag creation

eea9de3

batch and write code

407eda9

pull-request-size Bot added the size/L label May 12, 2026

Base automatically changed from generator-engine to main May 13, 2026 09:21

pull-request-size Bot added size/XXL and removed size/L labels May 13, 2026

shreyakhajanchi added 2 commits May 13, 2026 18:19

merge with main

3c46424

adding uts

51e1dac

pull-request-size Bot added size/XL and removed size/XXL labels May 13, 2026

shreyakhajanchi changed the title ~~batch and write code~~ Batch and Write transform code May 13, 2026

shreyakhajanchi marked this pull request as ready for review May 13, 2026 15:22

shreyakhajanchi requested a review from a team as a code owner May 13, 2026 15:22

shreyakhajanchi requested review from manitgupta and pratickchokhani May 13, 2026 15:22

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

manitgupta reviewed May 14, 2026

View reviewed changes

addressing comments

f4a73ec

		private transient volatile DataGeneratorSchema schema;
		private transient volatile List<String> insertTopoOrder;

		FailureRecord.toJson(
		"UNKNOWN_TABLE", FailureRecord.OPERATION_GENERATION, null, timerError));

		public class BatchAndWrite
		extends PTransform<PCollection<KV<Integer, GeneratedRecord>>, PCollection<String>> {

Conversation

shreyakhajanchi commented May 12, 2026

Uh oh!

gemini-code-assist Bot commented May 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 13, 2026 •

edited

Loading