Skip to content

Move UDF after context creation and update consumers to use modified …#3784

Open
namita-l wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
namita-l:nll-npe
Open

Move UDF after context creation and update consumers to use modified …#3784
namita-l wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
namita-l:nll-npe

Conversation

@namita-l
Copy link
Copy Markdown
Contributor

…JSON data in DataStreamMongoDBToFirestore

@namita-l namita-l requested a review from a team as a code owner May 11, 2026 16:19
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the DataStreamMongoDBToFirestore pipeline to improve the flexibility and reliability of JavaScript UDF transformations. By shifting the UDF execution to follow the creation of the event context, the pipeline can now operate on structured document data rather than raw JSON strings. Additionally, the changes introduce robust error handling for UDF failures and better management of specific event types like deletes and updates with null payloads.

Highlights

  • UDF Execution Timing: Moved the JavaScript UDF execution to occur after the creation of MongoDbChangeEventContext objects, allowing for more granular transformation of document data.
  • Data Transformation Pipeline: Introduced new DoFns (ExtractUdfInputFn and MergeUdfResultFn) to handle the extraction of JSON document data for UDF processing and merging the transformed output back into the event context.
  • Improved Handling of Deletes: Added logic to bypass UDF execution for delete events, which lack document data, preventing unnecessary processing errors.
  • Robustness and Error Handling: Implemented specific DLQ handling for UDF execution and parsing failures, ensuring data integrity during transformation.
  • Event Filtering: Updated CreateMongoDbChangeEventContextFn to ignore update events with null data, adding a metric counter to track these occurrences.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the JavaScript UDF application logic in the MongoDB to Firestore template, moving the transformation stage to occur after the creation of MongoDbChangeEventContext objects. Key changes include bypassing UDFs for delete events, merging transformed data back into the event context, and adding a filter for UPDATE events with null data. Feedback from the review highlights opportunities to reduce code duplication by refactoring the UDF pipeline logic into a shared method, improving DLQ usability by ensuring outputs are valid JSON, and maintaining BSON type fidelity by using specialized BSON utilities for JSON serialization.

Comment on lines +562 to +624
if (!Strings.isNullOrEmpty(options.getJavascriptTextTransformGcsPath())) {
LOG.info("Applying Javascript UDF for Document transformation after context creation");

// Split the stream into deletes and non-deletes before UDF.
// Delete events produce null docJson and would be dropped by ExtractUdfInputFn if not handled.
// We bypass UDF for deletes to avoid sending empty payloads to UDF.
PCollectionTuple udfPreparation =
successfulContexts.apply(
"Prepare UDF Input",
ParDo.of(new ExtractUdfInputFn())
.withOutputTags(
ExtractUdfInputFn.UDF_INPUT_TAG,
TupleTagList.of(ExtractUdfInputFn.DELETES_TAG)));

// The String in FailsafeElement is the JSON representation of the document data extracted by ExtractUdfInputFn.
PCollection<FailsafeElement<MongoDbChangeEventContext, String>> udfInput =
udfPreparation.get(ExtractUdfInputFn.UDF_INPUT_TAG);
PCollection<MongoDbChangeEventContext> deletes =
udfPreparation.get(ExtractUdfInputFn.DELETES_TAG);

// Apply the JavaScript UDF to the JSON payload extracted from the document.
PCollectionTuple udfResult =
udfInput.apply(
"Run UDF on Document",
FailsafeJavascriptUdf.<MongoDbChangeEventContext>newBuilder()
.setFileSystemPath(options.getJavascriptTextTransformGcsPath())
.setFunctionName(options.getJavascriptTextTransformFunctionName())
.setReloadIntervalMinutes(
options.getJavascriptTextTransformReloadIntervalMinutes())
.setSuccessTag(UDF_SUCCESS_TAG)
.setFailureTag(UDF_FAILURE_TAG)
.build());

// After successful UDF execution, we update the MongoDbChangeEventContext
// with the modified JSON string so that subsequent stages use the transformed data.

TupleTag<MongoDbChangeEventContext> parseSuccessTag = new TupleTag<MongoDbChangeEventContext>() {};
TupleTag<FailsafeElement<MongoDbChangeEventContext, String>> parseFailureTag = new TupleTag<FailsafeElement<MongoDbChangeEventContext, String>>() {};

PCollectionTuple mergeResult =
udfResult
.get(UDF_SUCCESS_TAG)
.setCoder(
FailsafeElementCoder.of(
SerializableCoder.of(MongoDbChangeEventContext.class),
StringUtf8Coder.of()))
.apply(
"Merge UDF Result",
ParDo.of(new MergeUdfResultFn(parseFailureTag, options.getShadowCollectionPrefix()))
.withOutputTags(parseSuccessTag, TupleTagList.of(parseFailureTag)));

successfulContexts =
PCollectionList.of(deletes)
.and(mergeResult.get(parseSuccessTag))
.apply("Merge Deletes and UDF Results", Flatten.pCollections());

// Handle failed UDF processing (both execution and parse failures)
PCollection<FailsafeElement<MongoDbChangeEventContext, String>> executionFailures = udfResult.get(UDF_FAILURE_TAG);
PCollection<FailsafeElement<MongoDbChangeEventContext, String>> parseFailures = mergeResult.get(parseFailureTag);

writeFailedUDFToDlq(options, executionFailures, dlqManager.getSevereDlqDirectoryWithDateTime() + "udf_execution_failures/", "tmp_udf_execution_failed");
writeFailedUDFToDlq(options, parseFailures, dlqManager.getSevereDlqDirectoryWithDateTime() + "udf_parse_failures/", "tmp_udf_parse_failed");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for applying the JavaScript UDF is duplicated in both runWithBackfillFirst and runAllEventsTogether. This block is substantial and involves multiple pipeline stages (splitting deletes, UDF execution, result merging, and multi-stage DLQ handling). To improve maintainability and ensure that any future changes to the UDF logic are applied consistently across both processing modes, this should be refactored into a shared private static method.

new DoFn<FailsafeElement<MongoDbChangeEventContext, String>, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
c.output(c.element().toString());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using c.element().toString() for DLQ output results in a non-JSON string (typically formatted as FailsafeElement{originalPayload=..., payload=..., ...}). This makes the dead-letter queue difficult to process with automated tools. It is recommended to use a JSON serializer or a dedicated sanitizer (similar to MongoDbEventDeadLetterQueueSanitizer used in other parts of this template) to ensure the DLQ output is valid JSON.

Comment on lines +245 to +249
public String getDocumentDataAsJsonString() throws JsonProcessingException {
JsonNode eventNode = this.getChangeEvent();
JsonNode dataNode = eventNode.get("data");
return dataNode != null ? OBJECT_MAPPER.writeValueAsString(dataNode) : null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The getDocumentDataAsJsonString method uses a standard Jackson ObjectMapper to serialize the data node. For MongoDB documents, this may lose type fidelity for BSON-specific types (e.g., $date, $numberLong, $oid) if the mapper is not specifically configured for MongoDB Extended JSON. Since the pipeline elsewhere relies on org.bson.Document and JsonMode.EXTENDED, consider using BSON utilities to extract this string to ensure the UDF receives a correctly formatted Extended JSON representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant