fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type#3796
fix(sourcedb-to-spanner): Add support for partitioning PostgreSQL UUID data type#3796aasthabharill wants to merge 6 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a failure in the sourcedb-to-spanner pipeline when migrating tables with UUID primary keys or unique indexes. By default, PostgreSQL classifies UUIDs as an unknown type category, which previously caused the pipeline to misidentify them as non-partitionable. The changes ensure these columns are treated as strings, allowing the splitting engine to successfully partition the data for parallel processing. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the PostgreSQL dialect adapter to support UUID columns by mapping them to the STRING index type, which allows them to be used for partitioning. The reviewer suggested optimizing the implementation by caching the 'type_name' value to avoid redundant JDBC calls and explicitly setting the string length to 36 for UUIDs to improve partitioning performance.
0d25f6f to
2929e21
Compare
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #3796 +/- ##
============================================
+ Coverage 53.41% 59.43% +6.01%
+ Complexity 6629 2173 -4456
============================================
Files 1082 506 -576
Lines 65795 29460 -36335
Branches 7328 3229 -4099
============================================
- Hits 35147 17509 -17638
+ Misses 28288 10968 -17320
+ Partials 2360 983 -1377
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request updates the PostgreSQLDialectAdapter to correctly handle PostgreSQL uuid types during table index discovery. Specifically, it now identifies uuid columns as STRING index types with a fixed maximum length of 36 characters. Corresponding unit tests were added to verify this behavior. I have no feedback to provide as there were no review comments to evaluate.
VardhanThigle
left a comment
There was a problem hiding this comment.
Looks good overall, a small open point.
Could be considered nit accroding to the turn around time needed for this issue.
ee56ef6 to
4fd97fa
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces support for PostgreSQL UUID types in the sourcedb-to-spanner template. Key changes include the implementation of a virtual "UUID" collation in the CollationMapper to handle bidirectional mapping between UUID strings and BigInteger, and updates to the PostgreSQLDialectAdapter to handle schema discovery and SQL casting for UUID columns. Feedback focuses on potential issues with the stateful management of column casting wrappers in instance-level maps, which could lead to memory pressure or serialization issues. Additionally, the identifier cleaning logic in the new ColumnKey class does not correctly handle PostgreSQL's case-sensitive quoted identifiers or schema namespaces, and the use of the magic string "UUID" should be replaced with a shared constant.
| private final Map<ColumnKey, String> columnCastWrappers = | ||
| new java.util.concurrent.ConcurrentHashMap<>(); | ||
| private final Map<ColumnKey, String> columnParameterCastWrappers = | ||
| new java.util.concurrent.ConcurrentHashMap<>(); |
There was a problem hiding this comment.
The use of instance-level maps to store per-column casting state discovered during schema discovery is potentially problematic. While PostgreSQLDialectAdapter is serializable, relying on side effects from discoverTableIndexes to populate state used in query generation can lead to issues if the adapter is re-instantiated or if discovery is skipped. Additionally, these maps grow indefinitely as more tables are discovered, which could lead to memory pressure in jobs processing a very large number of tables. Consider passing the necessary type information directly to the query generation methods or using a more stateless approach.
References
- In non-performance-critical code paths or corner cases, avoid micro-optimizations (such as caching a method result used only a few times) if they do not significantly improve readability or performance.
| private static String clean(String identifier) { | ||
| if (identifier == null) { | ||
| return ""; | ||
| } | ||
| return identifier.replace("\"", "").toLowerCase(); | ||
| } |
There was a problem hiding this comment.
The clean method aggressively converts identifiers to lowercase. In PostgreSQL, identifiers are case-insensitive only when unquoted; quoted identifiers (e.g., "MyTable") are case-sensitive. This implementation will cause collisions if a schema contains both mytable and "MyTable", and will fail to match mixed-case quoted identifiers correctly. Furthermore, the ColumnKey does not include the schema/namespace, which could lead to collisions if multiple schemas are processed using the same adapter instance.
| if (element == null) { | ||
| return BigInteger.valueOf(-1); | ||
| } | ||
| if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) { |
There was a problem hiding this comment.
The string literal "UUID" is used as a 'virtual' collation name. It would be better to define this as a constant (e.g., VIRTUAL_UUID_COLLATION) in a shared location to avoid magic strings and ensure consistency between the DialectAdapter and the CollationMapper.
| if ("UUID".equalsIgnoreCase(this.collationReference().dbCollation())) { | |
| if (VIRTUAL_UUID_COLLATION.equalsIgnoreCase(this.collationReference().dbCollation())) { |
https://b.corp.google.com/issues/512087945
The Issue
During a bulk migration from PostgreSQL to Cloud Spanner using the
sourcedb-to-spannertemplate, the pipeline launch failed with aSuitableIndexNotFoundException(thrown inJdbcIoWrapper/PipelineController) when a table’s primary key or unique index was ofUUIDdata type.Changes Made & Rationale
1. Map UUID Columns to a Virtual
"UUID"CollationPostgreSQLDialectAdapter.discoverTableIndexes, if thetypeNameof a column is"uuid", we assign"UUID"as its collation reference.CollationMapper.fromDBexpects a virtual"UUID"collation tag to trigger the static hexadecimal base-16 mapper (buildStaticUuidMapper). By assigning"UUID"during discovery, the splitter bypasses executing a database query to fetch collation rankings (which would fail or be extremely slow for a native UUID type that has no physical collation).2. Configure Virtual Type Length to
32for UUID ColumnsPostgreSQLDialectAdapter.discoverTableIndexes, iftypeLengthisnullandtypeNameis"uuid", we settypeLength = 32.CollationMapperstrips the hyphens out during mapping, leaving exactly 32 hexadecimal characters. Overriding the discovered length to32ensures that no additional padding (virtual zero-rank characters) is appended during range partitioning calculations, ensuring a clean 1-to-1 mapping and unmapping.3. Register State-Based Query and Parameter Cast Wrappers for UUID
PostgreSQLDialectAdapter.discoverTableIndexes, iftypeNameis"uuid", we register explicit SQL cast statements incolumnCastWrappersandcolumnParameterCastWrappersmaps.columnCastWrappers(CAST(%s AS TEXT)): Used ingetBoundaryQueryto queryMIN(CAST(col AS TEXT))andMAX(CAST(col AS TEXT)). This is necessary to retrieve the UUID boundaries safely as standard text strings compatible with JDBC. UUID doesnt have a MIN or MAX.columnParameterCastWrappers(CAST(? AS uuid)): Used ingetReadQueryandgetCountQueryto bind parameter boundary placeholders ascol >= CAST(? AS uuid). This is necessary because PostgreSQL does not support implicit comparison of standard JDBC string parameter bindings against nativeuuidcolumn types.4. Verify Changes with Unit & Integration Tests
testUuidCollationMapperin CollationMapperTest.java to verify that canonical UUID strings are mapped to 128-bit BigIntegers and unmapped back with correct formatting and hyphen insertion.testDiscoverTableIndexesWithUuidin PostgreSQLDialectAdapterTest.java verifying index discovery mappings, boundary query wrapping, and read/count query parameter bindings.getExpectedDatain PostgreSQLWithUniformizationIT.java to support assertions for tables with non-integer primary keys (uuid_pk).