Skip to content

feat: Support uuid as split column in postgres#180

Open
rajivharlalka wants to merge 6 commits intodatazip-inc:stagingfrom
rajivharlalka:rajivharlalka/backfill-uuid
Open

feat: Support uuid as split column in postgres#180
rajivharlalka wants to merge 6 commits intodatazip-inc:stagingfrom
rajivharlalka:rajivharlalka/backfill-uuid

Conversation

@rajivharlalka
Copy link
Copy Markdown

@rajivharlalka rajivharlalka commented Mar 26, 2025

Description

Fixes #144

Added checks to modify the SQL query with typecasting to TEXT datatype if the splitcolumn is stored as a string datatype in Olake.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Scenario A
    Added a table with fields id uuid(Primary Key), name text and checked backfill with id as splitkey.
  • Scenario B
    Added a table with fields id uuid(Primary Key), name text, roll int and checked backfill with id as the splitkey

Screenshots or Recordings

Related PR's (If Any):

@rajivharlalka rajivharlalka changed the base branch from master to staging March 26, 2025 05:28
@rajivharlalka
Copy link
Copy Markdown
Author

@hash-data any reviews please?

@hash-data
Copy link
Copy Markdown
Collaborator

@rajivharlalka reviewing today

@rajivharlalka
Copy link
Copy Markdown
Author

@hash-data any updates?

@hash-data hash-data self-requested a review April 2, 2025 17:43
@hash-data
Copy link
Copy Markdown
Collaborator

@rajivharlalka can you also mention the cases in which you have tested it.
Thanks

@hash-data hash-data changed the title Enable Backfill in PostgreSQL with UUID splitColumn feat: Support uuid as split column in postgres Apr 3, 2025
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 3, 2025

CLA assistant check
All committers have signed the CLA.

@rajivharlalka
Copy link
Copy Markdown
Author

@hash-data any further updates here?

Copy link
Copy Markdown
Collaborator

@hash-data hash-data left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Comments

Add checks if the splitColumn is a UUID(string in olake) dattatype
modify the SQL query to type-caste the value and column to TEXT for
comparisons between uuid.
Signed-off-by: Rajiv Harlalka <rajivharlalka009@gmail.com>
Signed-off-by: Rajiv Harlalka <rajivharlalka009@gmail.com>
@rajivharlalka rajivharlalka force-pushed the rajivharlalka/backfill-uuid branch from 8f51fec to 76054d3 Compare April 7, 2025 11:03
@hash-data hash-data requested a review from vikaxsh April 9, 2025 09:09
@rajivharlalka
Copy link
Copy Markdown
Author

@hash-data @vikash390 Any updates here? Could you test it out or should I help with a test PostgreSQL dump that could help.

@hash-data
Copy link
Copy Markdown
Collaborator

@rajivharlalka, we were busy with some ad hoc tasks and will try to finish testing today
Thanks

@zriyanshdz
Copy link
Copy Markdown
Contributor

hey @rajivharlalka as we are approaching towards final merge, could you just resolve the comments so we can get it merged soon?

Signed-off-by: Rajiv Harlalka <rajivharlalka009@gmail.com>
@rajivharlalka
Copy link
Copy Markdown
Author

@hash-data do let me know if there is any more changes needed.

@zriyanshdz zriyanshdz requested a review from hash-data May 20, 2025 12:14
return fmt.Sprintf("%s AND %s",
formatter(filterColumn, ">=", chunk.Min),
formatter(filterColumn, "<=", chunk.Max))
}
Copy link
Copy Markdown
Collaborator

@ImDoubD-datazip ImDoubD-datazip May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildChunkCondition function has a problem. As of no, what it is doing is it includes both the chunk min and chunk max boundary in a single chunk, so there is data duplication. Basically the chunk min for every chunk is getting duplicated as the max of previous chunk is the min of current chunk.

Expected behaviour => When he first chunk is formed, only then the all the data between the chunk min and chunk max including must be in that chunk, from next chunk onwards the chunk min (which was chunk max of previous chunk) should not be included in the chunk. It should only be included in the first chunk made.

@rajivharlalka

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// check for data distribution
// TODO: remove chunk intersections where chunks can be {0, 100} {100, 200}. Need to {0, 99} {100, 200}
splitChunks, err = p.splitTableIntoChunks(stream)

I feel this is an already understood problem and it's solution isn't in the buildChunk function. I read the todo and hence left the known problem on the idea that it'll get fixed later.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already solved. We have merged it to staging. Please take the latest pull of staging and modify code logic to include the uuid as split column.
@rajivharlalka

@ImDoubD-datazip
Copy link
Copy Markdown
Collaborator

Hi @rajivharlalka , I have created a PR for this uuid thingy in postgres. If you can possibly pull the current staging and do changes then I will be able to merge your PR and your contribution will be reflected else the PR i have created will be merged.
My PR: #405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: uuid as split column in postgres

5 participants