Skip to content

fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type#3796

Open
aasthabharill wants to merge 6 commits into
GoogleCloudPlatform:mainfrom
aasthabharill:bulk-partition-uuid
Open

fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type#3796
aasthabharill wants to merge 6 commits into
GoogleCloudPlatform:mainfrom
aasthabharill:bulk-partition-uuid

Conversation

@aasthabharill
Copy link
Copy Markdown
Member

@aasthabharill aasthabharill commented May 12, 2026

https://b.corp.google.com/issues/512087945

The Issue

During a bulk migration from PostgreSQL to Cloud Spanner using the sourcedb-to-spanner template, the pipeline launch failed with a SuitableIndexNotFoundException (thrown in JdbcIoWrapper / PipelineController) when a table’s primary key or unique index was of UUID data type.

Changes Made & Rationale

1. Map UUID Columns to a Virtual "UUID" Collation

  • Change: In PostgreSQLDialectAdapter.discoverTableIndexes, if the typeName of a column is "uuid", we assign "UUID" as its collation reference.
  • Why: CollationMapper.fromDB expects a virtual "UUID" collation tag to trigger the static hexadecimal base-16 mapper (buildStaticUuidMapper). By assigning "UUID" during discovery, the splitter bypasses executing a database query to fetch collation rankings (which would fail or be extremely slow for a native UUID type that has no physical collation).

2. Configure Virtual Type Length to 32 for UUID Columns

  • Change: In PostgreSQLDialectAdapter.discoverTableIndexes, if typeLength is null and typeName is "uuid", we set typeLength = 32.
  • Why: While a standard canonical UUID is 36 characters long (including hyphens), CollationMapper strips the hyphens out during mapping, leaving exactly 32 hexadecimal characters. Overriding the discovered length to 32 ensures that no additional padding (virtual zero-rank characters) is appended during range partitioning calculations, ensuring a clean 1-to-1 mapping and unmapping.

3. Register State-Based Query and Parameter Cast Wrappers for UUID

  • Change: In PostgreSQLDialectAdapter.discoverTableIndexes, if typeName is "uuid", we register explicit SQL cast statements in columnCastWrappers and columnParameterCastWrappers maps.
  • Why:
    • columnCastWrappers (CAST(%s AS TEXT)): Used in getBoundaryQuery to query MIN(CAST(col AS TEXT)) and MAX(CAST(col AS TEXT)). This is necessary to retrieve the UUID boundaries safely as standard text strings compatible with JDBC. UUID doesnt have a MIN or MAX.
    • columnParameterCastWrappers (CAST(? AS uuid)): Used in getReadQuery and getCountQuery to bind parameter boundary placeholders as col >= CAST(? AS uuid). This is necessary because PostgreSQL does not support implicit comparison of standard JDBC string parameter bindings against native uuid column types.

4. Verify Changes with Unit & Integration Tests

  • Collation Mapper Test: Added testUuidCollationMapper in CollationMapperTest.java to verify that canonical UUID strings are mapped to 128-bit BigIntegers and unmapped back with correct formatting and hyphen insertion.
  • Dialect Adapter Test: Added testDiscoverTableIndexesWithUuid in PostgreSQLDialectAdapterTest.java verifying index discovery mappings, boundary query wrapping, and read/count query parameter bindings.
  • Integration Test expected data logic: Updated getExpectedData in PostgreSQLWithUniformizationIT.java to support assertions for tables with non-integer primary keys (uuid_pk).

@aasthabharill aasthabharill requested a review from a team as a code owner May 12, 2026 05:13
@aasthabharill aasthabharill marked this pull request as draft May 12, 2026 05:14
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a failure in the sourcedb-to-spanner pipeline when migrating tables with UUID primary keys or unique indexes. By default, PostgreSQL classifies UUIDs as an unknown type category, which previously caused the pipeline to misidentify them as non-partitionable. The changes ensure these columns are treated as strings, allowing the splitting engine to successfully partition the data for parallel processing.

Highlights

  • UUID Data Type Support: Updated the PostgreSQL dialect adapter to explicitly recognize the 'uuid' data type, mapping it to IndexType.STRING to enable proper partitioning during bulk migrations.
  • Improved Index Discovery: Modified the index discovery logic to pass the column type name alongside the type category, allowing for more granular identification of database types.
  • Test Coverage: Added a new unit test case in PostgreSQLDialectAdapterTest to verify that UUID-based indexes are correctly identified and handled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the PostgreSQL dialect adapter to support UUID columns by mapping them to the STRING index type, which allows them to be used for partitioning. The reviewer suggested optimizing the implementation by caching the 'type_name' value to avoid redundant JDBC calls and explicitly setting the string length to 36 for UUIDs to improve partitioning performance.

@aasthabharill aasthabharill force-pushed the bulk-partition-uuid branch from 0d25f6f to 2929e21 Compare May 12, 2026 05:36
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.43%. Comparing base (f8472fe) to head (8d62dc2).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...ctadapter/postgresql/PostgreSQLDialectAdapter.java 60.00% 2 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3796      +/-   ##
============================================
+ Coverage     53.41%   59.43%   +6.01%     
+ Complexity     6629     2173    -4456     
============================================
  Files          1082      506     -576     
  Lines         65795    29460   -36335     
  Branches       7328     3229    -4099     
============================================
- Hits          35147    17509   -17638     
+ Misses        28288    10968   -17320     
+ Partials       2360      983    -1377     
Components Coverage Δ
spanner-templates 74.81% <60.00%> (+1.99%) ⬆️
spanner-import-export ∅ <ø> (∅)
spanner-live-forward-migration 80.84% <ø> (-0.09%) ⬇️
spanner-live-reverse-replication 77.01% <ø> (-0.04%) ⬇️
spanner-bulk-migration 91.02% <60.00%> (-0.08%) ⬇️
gcs-spanner-dv 86.67% <ø> (+0.92%) ⬆️
Files with missing lines Coverage Δ
...ctadapter/postgresql/PostgreSQLDialectAdapter.java 95.33% <60.00%> (-1.97%) ⬇️

... and 603 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aasthabharill aasthabharill marked this pull request as ready for review May 12, 2026 08:05
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the PostgreSQLDialectAdapter to correctly handle PostgreSQL uuid types during table index discovery. Specifically, it now identifies uuid columns as STRING index types with a fixed maximum length of 36 characters. Corresponding unit tests were added to verify this behavior. I have no feedback to provide as there were no review comments to evaluate.

VardhanThigle
VardhanThigle previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@VardhanThigle VardhanThigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, a small open point.
Could be considered nit accroding to the turn around time needed for this issue.

@pull-request-size pull-request-size Bot added size/L and removed size/M labels May 12, 2026
@aasthabharill aasthabharill added the improvement Making existing code better label May 12, 2026
VardhanThigle
VardhanThigle previously approved these changes May 12, 2026
@aasthabharill aasthabharill marked this pull request as draft May 12, 2026 12:38
@aasthabharill aasthabharill force-pushed the bulk-partition-uuid branch from ee56ef6 to 4fd97fa Compare May 12, 2026 18:16
@aasthabharill aasthabharill marked this pull request as ready for review May 12, 2026 18:16
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for PostgreSQL UUID types in the sourcedb-to-spanner template. Key changes include the implementation of a virtual "UUID" collation in the CollationMapper to handle bidirectional mapping between UUID strings and BigInteger, and updates to the PostgreSQLDialectAdapter to handle schema discovery and SQL casting for UUID columns. Feedback focuses on potential issues with the stateful management of column casting wrappers in instance-level maps, which could lead to memory pressure or serialization issues. Additionally, the identifier cleaning logic in the new ColumnKey class does not correctly handle PostgreSQL's case-sensitive quoted identifiers or schema namespaces, and the use of the magic string "UUID" should be replaced with a shared constant.

Comment on lines +86 to +89
private final Map<ColumnKey, String> columnCastWrappers =
new java.util.concurrent.ConcurrentHashMap<>();
private final Map<ColumnKey, String> columnParameterCastWrappers =
new java.util.concurrent.ConcurrentHashMap<>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of instance-level maps to store per-column casting state discovered during schema discovery is potentially problematic. While PostgreSQLDialectAdapter is serializable, relying on side effects from discoverTableIndexes to populate state used in query generation can lead to issues if the adapter is re-instantiated or if discovery is skipped. Additionally, these maps grow indefinitely as more tables are discovered, which could lead to memory pressure in jobs processing a very large number of tables. Consider passing the necessary type information directly to the query generation methods or using a more stateless approach.

References
  1. In non-performance-critical code paths or corner cases, avoid micro-optimizations (such as caching a method result used only a few times) if they do not significantly improve readability or performance.

Comment on lines +626 to +631
private static String clean(String identifier) {
if (identifier == null) {
return "";
}
return identifier.replace("\"", "").toLowerCase();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The clean method aggressively converts identifiers to lowercase. In PostgreSQL, identifiers are case-insensitive only when unquoted; quoted identifiers (e.g., "MyTable") are case-sensitive. This implementation will cause collisions if a schema contains both mytable and "MyTable", and will fail to match mixed-case quoted identifiers correctly. Furthermore, the ColumnKey does not include the schema/namespace, which could lead to collisions if multiple schemas are processed using the same adapter instance.

if (element == null) {
return BigInteger.valueOf(-1);
}
if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The string literal "UUID" is used as a 'virtual' collation name. It would be better to define this as a constant (e.g., VIRTUAL_UUID_COLLATION) in a shared location to avoid magic strings and ensure consistency between the DialectAdapter and the CollationMapper.

Suggested change
if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) {
if (VIRTUAL_UUID_COLLATION.equalsIgnoreCase(this.collationReference().dbCollation())) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix improvement Making existing code better size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants