Merged
Conversation
Why these changes are being introduced: After it was discovered that a single ETL run (run_id) has multiple, different run_timestamps in the dataset, it was clear we needed a way to pass an explicit run_timestamp for writing instead of relying on a run_timestamp minted as part of the TIMDEXDataset.write() method. How this addresses that need: DatasetRecord was given an optional run_timestamp property, that defaults to using run_date and producing a full ISO timestamp from that. Side effects of this change: * Applications like Transmogrifier can pass an explicit run_timestamp when writing to the dataset, ensuring that even multiple invocations of Transmogrifier + TIMDEXDataset.write() can share the same timestamp for an ETL run. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-509
ghukill
commented
Jun 17, 2025
|
|
||
|
|
||
| @pytest.mark.freeze_time("2025-05-22 01:23:45.567890") | ||
| def test_dataset_write_includes_minted_run_timestamp(tmp_path): |
Contributor
Author
There was a problem hiding this comment.
This test no longer makes sense as TIMDEXDataset.write() no longer mints a timestamp if one is not provided.
Either run_timestamp is provided explicitly to each DatasetRecord for writing, or it defaults to an ISO timestamp version of run_date for that record (which is required).
Pull Request Test Coverage Report for Build 15737672508Details
💛 - Coveralls |
ghukill
commented
Jun 18, 2025
Comment on lines
-426
to
-436
| run_timestamp = datetime.now(UTC) | ||
| for i, record_batch in enumerate( | ||
| itertools.batched(records_iter, self.config.write_batch_size) | ||
| ): | ||
| record_dicts = [ | ||
| { | ||
| **record.to_dict(), | ||
| "run_timestamp": run_timestamp, | ||
| } | ||
| for record in record_batch | ||
| ] |
Contributor
Author
There was a problem hiding this comment.
This is really the most important part of the PR: we no longer mint a run_timestamp value in this method, but instead assume/require that all DatasetRecord's that are getting written will have a value already.
770f7c9 to
029506e
Compare
This was referenced Jun 18, 2025
ehanson8
approved these changes
Jun 24, 2025
ehanson8
left a comment
There was a problem hiding this comment.
The logic of this change is very clear and it does make sense to generate/pass run_timestamp in transmogrifier rather than here
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR allows explicit
run_timestampto be passed when writing records to the dataset. Formerly, arun_timestampwas minted automatically duringTIMDEXDataset.write().This is achieved by adding
run_timestampto theDatasetRecordclass which allows it to organically get included for writing. If it is not set or passed, a timestamp from the requiredrun_datecolumn will be used (e.g.2025-01-01-->2025-01-01T00:00:00.000000).TIMX-509 is concerned with ensuring
run_timestampis the same for all rows, in all parquet files, for a single ETLrun_id. Because Transmogrifier can be invoked multiple times for an ETL run (e.g.almawhere there might be multiple source XML files), which in turn invokesTIMDEXDataset.write()multiple times, we need to be able to determine therun_timestampat the run level and then pass it along all the way to actual dataset writing performed by TDA. This establishes that for TDA, and future PRs will work it backwards through Transmog (update: Transmog PR) and the pipeline lambdas (update: lambda PR).How can a reviewer manually see the effects of these changes?
It's difficult to see easily. Looking at the test fixture
dataset_with_same_day_runsis probably the best way to get a feel for it.Now when we generate sample records for writing, we are including an explicit
run_timestampas part of theDatasetRecordthat we feed toTIMDEXDataset.write(). This is ultimately the whole change here: instead of arun_timestampgetting automatically generated for all records per.write()call,.write()now just expects that column to be set already.Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: TIMX-509 will require these changes to TDA, and updates to Transmog and pipeline lambda, all of which should get deployed together.
What are the relevant tickets?
Developer
Code Reviewer(s)