Skip to content

Batch and Write transform code#3797

Open
shreyakhajanchi wants to merge 6 commits into
mainfrom
batch-write
Open

Batch and Write transform code#3797
shreyakhajanchi wants to merge 6 commits into
mainfrom
batch-write

Conversation

@shreyakhajanchi
Copy link
Copy Markdown
Contributor

No description provided.

Base automatically changed from generator-engine to main May 13, 2026 09:21
@shreyakhajanchi shreyakhajanchi changed the title batch and write code Batch and Write transform code May 13, 2026
@shreyakhajanchi shreyakhajanchi marked this pull request as ready for review May 13, 2026 15:22
@shreyakhajanchi shreyakhajanchi requested a review from a team as a code owner May 13, 2026 15:22
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements the core batching and writing logic for the CDC data generator. It introduces a stateful DoFn to manage record processing, batching, and sink interaction, along with a PTransform to integrate this logic into the broader pipeline. The changes also include a new data model for records and robust unit testing to validate the new components.

Highlights

  • New Data Transformation Pipeline: Introduced the BatchAndWrite PTransform and corresponding BatchAndWriteFn DoFn to handle batching and writing of generated records to configured sinks.
  • Data Model Enhancement: Added the GeneratedRecord class as a type-safe container for table names and primary key values.
  • Testing Infrastructure: Added comprehensive unit tests for both the BatchAndWrite transform and the BatchAndWriteFn logic to ensure correct behavior and lifecycle management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the BatchAndWrite transform and its corresponding DoFn, BatchAndWriteFn, which manage stateful data generation and writing to MySQL and Spanner sinks. The changes include the GeneratedRecord data model and comprehensive unit tests. Review feedback highlights potential NullPointerException risks in the BatchAndWriteFn constructor due to auto-unboxing of nullable Integer parameters and suggests providing default values. Additionally, a test case was identified as misleading because it fails to assert the specific fallback behavior it claims to verify.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 88.49558% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.79%. Comparing base (0025d7e) to head (f4a73ec).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ud/teleport/v2/templates/dofn/BatchAndWriteFn.java 89.71% 8 Missing and 3 partials ⚠️
.../teleport/v2/templates/sink/DataWriterFactory.java 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3797      +/-   ##
============================================
+ Coverage     53.62%   53.79%   +0.17%     
- Complexity     6323     6368      +45     
============================================
  Files          1087     1090       +3     
  Lines         66762    66917     +155     
  Branches       7476     7485       +9     
============================================
+ Hits          35801    36001     +200     
+ Misses        28534    28483      -51     
- Partials       2427     2433       +6     
Components Coverage Δ
spanner-templates 72.84% <ø> (+0.04%) ⬆️
spanner-import-export 68.66% <ø> (+0.13%) ⬆️
spanner-live-forward-migration 80.93% <ø> (-0.02%) ⬇️
spanner-live-reverse-replication 77.08% <ø> (-0.02%) ⬇️
spanner-bulk-migration 91.10% <ø> (-0.01%) ⬇️
gcs-spanner-dv 85.74% <ø> (-0.02%) ⬇️
Files with missing lines Coverage Δ
...d/teleport/v2/templates/model/GeneratedRecord.java 100.00% <100.00%> (ø)
.../teleport/v2/templates/sink/DataWriterFactory.java 50.00% <50.00%> (ø)
...ud/teleport/v2/templates/dofn/BatchAndWriteFn.java 89.71% <89.71%> (ø)

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +73 to +74
private transient volatile DataGeneratorSchema schema;
private transient volatile List<String> insertTopoOrder;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does volatile do here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using volatile guarantees that once schema and insertTopoOrder are published by the first bundle, all subsequent bundles and timer callbacks on that worker instance will deterministically observe the fully constructed objects

Comment on lines +131 to +140
protected DataWriter createWriter(SinkType type, String configPath) {
switch (type) {
case MYSQL:
return new MySqlDataWriter(configPath);
case SPANNER:
return new SpannerDataWriter(configPath);
default:
throw new IllegalArgumentException("Unsupported sink type: " + type);
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can consider moving this out to a factory for cleaner separation. In future, constructing writers may become more complex and segregation will help.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will move

Comment on lines +214 to +215
FailureRecord.toJson(
"UNKNOWN_TABLE", FailureRecord.OPERATION_GENERATION, null, timerError));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this always UNKNOWN_TABLE?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All event-specific and table-specific errors are caught inside DataGeneratorEngine, any exception that escapes to the outer catch (Exception timerError) block in BatchAndWriteFn.java represents a system failure hence it doesn't have a table name.

Comment on lines +259 to +268
private void writeFailedRecords(Consumer<String> sink) {
List<String> dlq = batcher.getFailedRecords();
if (dlq == null || dlq.isEmpty()) {
return;
}
for (String record : dlq) {
sink.accept(record);
}
batcher.clearDlq();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get what this method is doing (I don't fully understand the Consumer function). Where are the failed records from the DLQ being written to?

Also, what is the function of the DLQ? Are users expected to re-run the DLQ, or is it more for reporting purpose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is outputting the failed records from the dofn, it is written to a gcs file and is only for reporting not retrying


/** Type-safe container wrapping table names and primary key values. */
@AutoValue
public abstract class GeneratedRecord implements Serializable {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more Serializable you introduce now the harder it will be to properly clean it up later...just keep it in mind.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but i have retained it in classes which use beam row, because my analysis suggests that it might require custom coder and solving it for 1 class will help me solve for all others so I would like to take them up at once

Comment on lines +39 to +40
public class BatchAndWrite
extends PTransform<PCollection<KV<Integer, GeneratedRecord>>, PCollection<String>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to only contain the BatchAndWriteFn. Given the strictly coupled/identical naming (BatchAndWrite and BatchAndWriteFn), do we need this transform at all? Can we wire the DoFn directly? Is there some future extensibility use-case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants