Improve performance of new transform for large tables#68
Merged
brycekbargar merged 18 commits intolibrary-data-platform:release-v4.0.0from Mar 10, 2026
Merged
Conversation
d33f3c4
into
library-data-platform:release-v4.0.0
1 check passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The biggest draw of refactoring the transformation logic to happen in postgres vs python was a performance speed up. Unfortunately as I had originally implemented it postgres ran out of memory and died on any table with a row x column count approaching a million. This PR is the result of a lot of performance tuning and optimization to get the memory usage down while still being fast. I've verified this on a table with 6 million rows and feel confident about the biggest tables (which testing will happen soon).
During testing I realized I forgot the progress bars which was really annoying as I had no idea if transformation was doing anything so I added them back in this PR. I also realized that indexing did not work on tables with schemas and fixed it.
Note: Postgres 14 is required for negative indexing on the table name in the tcatalog table.