Skip to content

TIMX 509 - explicit run timestamp#149

Merged
ghukill merged 2 commits intomainfrom
TIMX-509-explicit-run-timestamp
Jun 24, 2025
Merged

TIMX 509 - explicit run timestamp#149
ghukill merged 2 commits intomainfrom
TIMX-509-explicit-run-timestamp

Conversation

@ghukill
Copy link
Copy Markdown
Contributor

@ghukill ghukill commented Jun 17, 2025

Purpose and background context

This PR allows explicit run_timestamp to be passed when writing records to the dataset. Formerly, a run_timestamp was minted automatically during TIMDEXDataset.write().

This is achieved by adding run_timestamp to the DatasetRecord class which allows it to organically get included for writing. If it is not set or passed, a timestamp from the required run_date column will be used (e.g. 2025-01-01 --> 2025-01-01T00:00:00.000000).

TIMX-509 is concerned with ensuring run_timestamp is the same for all rows, in all parquet files, for a single ETL run_id. Because Transmogrifier can be invoked multiple times for an ETL run (e.g. alma where there might be multiple source XML files), which in turn invokes TIMDEXDataset.write() multiple times, we need to be able to determine the run_timestamp at the run level and then pass it along all the way to actual dataset writing performed by TDA. This establishes that for TDA, and future PRs will work it backwards through Transmog (update: Transmog PR) and the pipeline lambdas (update: lambda PR).

How can a reviewer manually see the effects of these changes?

It's difficult to see easily. Looking at the test fixture dataset_with_same_day_runs is probably the best way to get a feel for it.

Now when we generate sample records for writing, we are including an explicit run_timestamp as part of the DatasetRecord that we feed to TIMDEXDataset.write(). This is ultimately the whole change here: instead of a run_timestamp getting automatically generated for all records per .write() call, .write() now just expects that column to be set already.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: TIMX-509 will require these changes to TDA, and updates to Transmog and pipeline lambda, all of which should get deployed together.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

After it was discovered that a single ETL run (run_id) has multiple, different
run_timestamps in the dataset, it was clear we needed a way to pass an explicit
run_timestamp for writing instead of relying on a run_timestamp minted as part
of the TIMDEXDataset.write() method.

How this addresses that need:

DatasetRecord was given an optional run_timestamp property, that defaults
to using run_date and producing a full ISO timestamp from that.

Side effects of this change:
* Applications like Transmogrifier can pass an explicit run_timestamp
when writing to the dataset, ensuring that even multiple invocations
of Transmogrifier + TIMDEXDataset.write() can share the same timestamp
for an ETL run.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-509
Comment thread tests/test_dataset.py


@pytest.mark.freeze_time("2025-05-22 01:23:45.567890")
def test_dataset_write_includes_minted_run_timestamp(tmp_path):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test no longer makes sense as TIMDEXDataset.write() no longer mints a timestamp if one is not provided.

Either run_timestamp is provided explicitly to each DatasetRecord for writing, or it defaults to an ISO timestamp version of run_date for that record (which is required).

@coveralls
Copy link
Copy Markdown

coveralls commented Jun 17, 2025

Pull Request Test Coverage Report for Build 15737672508

Details

  • 9 of 9 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.08%) to 95.625%

Totals Coverage Status
Change from base Build 15424451546: 0.08%
Covered Lines: 306
Relevant Lines: 320

💛 - Coveralls

@ghukill ghukill marked this pull request as ready for review June 18, 2025 13:11
Comment on lines -426 to -436
run_timestamp = datetime.now(UTC)
for i, record_batch in enumerate(
itertools.batched(records_iter, self.config.write_batch_size)
):
record_dicts = [
{
**record.to_dict(),
"run_timestamp": run_timestamp,
}
for record in record_batch
]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the most important part of the PR: we no longer mint a run_timestamp value in this method, but instead assume/require that all DatasetRecord's that are getting written will have a value already.

@ghukill ghukill requested a review from a team June 18, 2025 13:13
Copy link
Copy Markdown

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this change is very clear and it does make sense to generate/pass run_timestamp in transmogrifier rather than here

@ehanson8 ehanson8 self-assigned this Jun 24, 2025
@ghukill ghukill merged commit a479435 into main Jun 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants