fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type by aasthabharill · Pull Request #3796 · GoogleCloudPlatform/DataflowTemplates

aasthabharill · 2026-05-12T05:13:58Z

https://b.corp.google.com/issues/512087945

The Issue

During a bulk migration from PostgreSQL to Cloud Spanner using the sourcedb-to-spanner template, the pipeline launch failed with a SuitableIndexNotFoundException (thrown in JdbcIoWrapper / PipelineController) when a table’s primary key or unique index was of UUID data type.

Changes Made & Rationale

1. Map UUID Columns to a Virtual `"UUID"` Collation

Change: In PostgreSQLDialectAdapter.discoverTableIndexes, if the typeName of a column is "uuid", we assign "UUID" as its collation reference.
Why: CollationMapper.fromDB expects a virtual "UUID" collation tag to trigger the static hexadecimal base-16 mapper (buildStaticUuidMapper). By assigning "UUID" during discovery, the splitter bypasses executing a database query to fetch collation rankings (which would fail or be extremely slow for a native UUID type that has no physical collation).

2. Configure Virtual Type Length to `32` for UUID Columns

Change: In PostgreSQLDialectAdapter.discoverTableIndexes, if typeLength is null and typeName is "uuid", we set typeLength = 32.
Why: While a standard canonical UUID is 36 characters long (including hyphens), CollationMapper strips the hyphens out during mapping, leaving exactly 32 hexadecimal characters. Overriding the discovered length to 32 ensures that no additional padding (virtual zero-rank characters) is appended during range partitioning calculations, ensuring a clean 1-to-1 mapping and unmapping.

3. Register State-Based Query and Parameter Cast Wrappers for UUID

Change: In PostgreSQLDialectAdapter.discoverTableIndexes, if typeName is "uuid", we register explicit SQL cast statements in columnCastWrappers and columnParameterCastWrappers maps.
Why:
- columnCastWrappers (CAST(%s AS TEXT)): Used in getBoundaryQuery to query MIN(CAST(col AS TEXT)) and MAX(CAST(col AS TEXT)). This is necessary to retrieve the UUID boundaries safely as standard text strings compatible with JDBC. UUID doesnt have a MIN or MAX.
- columnParameterCastWrappers (CAST(? AS uuid)): Used in getReadQuery and getCountQuery to bind parameter boundary placeholders as col >= CAST(? AS uuid). This is necessary because PostgreSQL does not support implicit comparison of standard JDBC string parameter bindings against native uuid column types.

4. Verify Changes with Unit & Integration Tests

Collation Mapper Test: Added testUuidCollationMapper in CollationMapperTest.java to verify that canonical UUID strings are mapped to 128-bit BigIntegers and unmapped back with correct formatting and hyphen insertion.
Dialect Adapter Test: Added testDiscoverTableIndexesWithUuid in PostgreSQLDialectAdapterTest.java verifying index discovery mappings, boundary query wrapping, and read/count query parameter bindings.
Integration Test expected data logic: Updated getExpectedData in PostgreSQLWithUniformizationIT.java to support assertions for tables with non-integer primary keys (uuid_pk).

gemini-code-assist · 2026-05-12T05:17:18Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a failure in the sourcedb-to-spanner pipeline when migrating tables with UUID primary keys or unique indexes. By default, PostgreSQL classifies UUIDs as an unknown type category, which previously caused the pipeline to misidentify them as non-partitionable. The changes ensure these columns are treated as strings, allowing the splitting engine to successfully partition the data for parallel processing.

Highlights

UUID Data Type Support: Updated the PostgreSQL dialect adapter to explicitly recognize the 'uuid' data type, mapping it to IndexType.STRING to enable proper partitioning during bulk migrations.
Improved Index Discovery: Modified the index discovery logic to pass the column type name alongside the type category, allowing for more granular identification of database types.
Test Coverage: Added a new unit test case in PostgreSQLDialectAdapterTest to verify that UUID-based indexes are correctly identified and handled.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the PostgreSQL dialect adapter to support UUID columns by mapping them to the STRING index type, which allows them to be used for partitioning. The reviewer suggested optimizing the implementation by caching the 'type_name' value to avoid redundant JDBC calls and explicitly setting the string length to 36 for UUIDs to improve partitioning performance.

codecov · 2026-05-12T06:30:04Z

Codecov Report

❌ Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.43%. Comparing base (f8472fe) to head (8d62dc2).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...ctadapter/postgresql/PostgreSQLDialectAdapter.java	60.00%	2 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3796      +/-   ##
============================================
+ Coverage     53.41%   59.43%   +6.01%     
+ Complexity     6629     2173    -4456     
============================================
  Files          1082      506     -576     
  Lines         65795    29460   -36335     
  Branches       7328     3229    -4099     
============================================
- Hits          35147    17509   -17638     
+ Misses        28288    10968   -17320     
+ Partials       2360      983    -1377

Components	Coverage Δ
spanner-templates	`74.81% <60.00%> (+1.99%)`	⬆️
spanner-import-export	`∅ <ø> (∅)`
spanner-live-forward-migration	`80.84% <ø> (-0.09%)`	⬇️
spanner-live-reverse-replication	`77.01% <ø> (-0.04%)`	⬇️
spanner-bulk-migration	`91.02% <60.00%> (-0.08%)`	⬇️
gcs-spanner-dv	`86.67% <ø> (+0.92%)`	⬆️

Files with missing lines	Coverage Δ
...ctadapter/postgresql/PostgreSQLDialectAdapter.java	`95.33% <60.00%> (-1.97%)`	⬇️

... and 603 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gemini-code-assist

Code Review

This pull request updates the PostgreSQLDialectAdapter to correctly handle PostgreSQL uuid types during table index discovery. Specifically, it now identifies uuid columns as STRING index types with a fixed maximum length of 36 characters. Corresponding unit tests were added to verify this behavior. I have no feedback to provide as there were no review comments to evaluate.

VardhanThigle

Looks good overall, a small open point.
Could be considered nit accroding to the turn around time needed for this issue.

gemini-code-assist

Code Review

This pull request introduces support for PostgreSQL UUID types in the sourcedb-to-spanner template. Key changes include the implementation of a virtual "UUID" collation in the CollationMapper to handle bidirectional mapping between UUID strings and BigInteger, and updates to the PostgreSQLDialectAdapter to handle schema discovery and SQL casting for UUID columns. Feedback focuses on potential issues with the stateful management of column casting wrappers in instance-level maps, which could lead to memory pressure or serialization issues. Additionally, the identifier cleaning logic in the new ColumnKey class does not correctly handle PostgreSQL's case-sensitive quoted identifiers or schema namespaces, and the use of the magic string "UUID" should be replaced with a shared constant.

gemini-code-assist · 2026-05-12T18:27:17Z

+  private final Map<ColumnKey, String> columnCastWrappers =
+      new java.util.concurrent.ConcurrentHashMap<>();
+  private final Map<ColumnKey, String> columnParameterCastWrappers =
+      new java.util.concurrent.ConcurrentHashMap<>();


The use of instance-level maps to store per-column casting state discovered during schema discovery is potentially problematic. While PostgreSQLDialectAdapter is serializable, relying on side effects from discoverTableIndexes to populate state used in query generation can lead to issues if the adapter is re-instantiated or if discovery is skipped. Additionally, these maps grow indefinitely as more tables are discovered, which could lead to memory pressure in jobs processing a very large number of tables. Consider passing the necessary type information directly to the query generation methods or using a more stateless approach.

References

In non-performance-critical code paths or corner cases, avoid micro-optimizations (such as caching a method result used only a few times) if they do not significantly improve readability or performance.

gemini-code-assist · 2026-05-12T18:27:17Z

+    private static String clean(String identifier) {
+      if (identifier == null) {
+        return "";
+      }
+      return identifier.replace("\"", "").toLowerCase();
+    }


The clean method aggressively converts identifiers to lowercase. In PostgreSQL, identifiers are case-insensitive only when unquoted; quoted identifiers (e.g., "MyTable") are case-sensitive. This implementation will cause collisions if a schema contains both mytable and "MyTable", and will fail to match mixed-case quoted identifiers correctly. Furthermore, the ColumnKey does not include the schema/namespace, which could lead to collisions if multiple schemas are processed using the same adapter instance.

gemini-code-assist · 2026-05-12T18:27:17Z

    if (element == null) {
      return BigInteger.valueOf(-1);
    }
+    if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) {


The string literal "UUID" is used as a 'virtual' collation name. It would be better to define this as a constant (e.g., VIRTUAL_UUID_COLLATION) in a shared location to avoid magic strings and ensure consistency between the DialectAdapter and the CollationMapper.

Suggested change

if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) {

if (VIRTUAL_UUID_COLLATION.equalsIgnoreCase(this.collationReference().dbCollation())) {

aasthabharill requested a review from a team as a code owner May 12, 2026 05:13

aasthabharill requested review from pratickchokhani and rohitwali May 12, 2026 05:13

pull-request-size Bot added the size/M label May 12, 2026

aasthabharill marked this pull request as draft May 12, 2026 05:14

aasthabharill added the bug-fix label May 12, 2026

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread ...ud/teleport/v2/source/reader/io/jdbc/dialectadapter/postgresql/PostgreSQLDialectAdapter.java Outdated

initial changes

2929e21

aasthabharill force-pushed the bulk-partition-uuid branch from 0d25f6f to 2929e21 Compare May 12, 2026 05:36

initial changes

dfa515f

aasthabharill marked this pull request as ready for review May 12, 2026 08:05

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

aasthabharill requested a review from VardhanThigle May 12, 2026 08:18

VardhanThigle previously approved these changes May 12, 2026

View reviewed changes

Comment thread ...ud/teleport/v2/source/reader/io/jdbc/dialectadapter/postgresql/PostgreSQLDialectAdapter.java Outdated

review changes

8d62dc2

aasthabharill dismissed VardhanThigle’s stale review via 8d62dc2 May 12, 2026 09:58

aasthabharill added 2 commits May 12, 2026 15:43

add IT

dbe9799

fix IT

41dea52

pull-request-size Bot added size/L and removed size/M labels May 12, 2026

aasthabharill added the improvement Making existing code better label May 12, 2026

VardhanThigle previously approved these changes May 12, 2026

View reviewed changes

aasthabharill marked this pull request as draft May 12, 2026 12:38

fix IT

4fd97fa

aasthabharill dismissed VardhanThigle’s stale review via 4fd97fa May 12, 2026 18:16

aasthabharill force-pushed the bulk-partition-uuid branch from ee56ef6 to 4fd97fa Compare May 12, 2026 18:16

aasthabharill marked this pull request as ready for review May 12, 2026 18:16

aasthabharill requested a review from VardhanThigle May 12, 2026 18:16

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type#3796

fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type#3796
aasthabharill wants to merge 6 commits into
GoogleCloudPlatform:mainfrom
aasthabharill:bulk-partition-uuid

aasthabharill commented May 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

VardhanThigle left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) {
	if (VIRTUAL_UUID_COLLATION.equalsIgnoreCase(this.collationReference().dbCollation())) {

Conversation

aasthabharill commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Issue

Changes Made & Rationale

1. Map UUID Columns to a Virtual "UUID" Collation

2. Configure Virtual Type Length to 32 for UUID Columns

3. Register State-Based Query and Parameter Cast Wrappers for UUID

4. Verify Changes with Unit & Integration Tests

Uh oh!

gemini-code-assist Bot commented May 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

VardhanThigle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aasthabharill commented May 12, 2026 •

edited

Loading

1. Map UUID Columns to a Virtual `"UUID"` Collation

2. Configure Virtual Type Length to `32` for UUID Columns

codecov Bot commented May 12, 2026 •

edited

Loading