TIMX 509 - explicit run timestamp by ghukill · Pull Request #149 · MITLibraries/timdex-dataset-api

ghukill · 2025-06-17T21:01:17Z

Purpose and background context

This PR allows explicit run_timestamp to be passed when writing records to the dataset. Formerly, a run_timestamp was minted automatically during TIMDEXDataset.write().

This is achieved by adding run_timestamp to the DatasetRecord class which allows it to organically get included for writing. If it is not set or passed, a timestamp from the required run_date column will be used (e.g. 2025-01-01 --> 2025-01-01T00:00:00.000000).

TIMX-509 is concerned with ensuring run_timestamp is the same for all rows, in all parquet files, for a single ETL run_id. Because Transmogrifier can be invoked multiple times for an ETL run (e.g. alma where there might be multiple source XML files), which in turn invokes TIMDEXDataset.write() multiple times, we need to be able to determine the run_timestamp at the run level and then pass it along all the way to actual dataset writing performed by TDA. This establishes that for TDA, and future PRs will work it backwards through Transmog (update: Transmog PR) and the pipeline lambdas (update: lambda PR).

How can a reviewer manually see the effects of these changes?

It's difficult to see easily. Looking at the test fixture dataset_with_same_day_runs is probably the best way to get a feel for it.

Now when we generate sample records for writing, we are including an explicit run_timestamp as part of the DatasetRecord that we feed to TIMDEXDataset.write(). This is ultimately the whole change here: instead of a run_timestamp getting automatically generated for all records per .write() call, .write() now just expects that column to be set already.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: TIMX-509 will require these changes to TDA, and updates to Transmog and pipeline lambda, all of which should get deployed together.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-509

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: After it was discovered that a single ETL run (run_id) has multiple, different run_timestamps in the dataset, it was clear we needed a way to pass an explicit run_timestamp for writing instead of relying on a run_timestamp minted as part of the TIMDEXDataset.write() method. How this addresses that need: DatasetRecord was given an optional run_timestamp property, that defaults to using run_date and producing a full ISO timestamp from that. Side effects of this change: * Applications like Transmogrifier can pass an explicit run_timestamp when writing to the dataset, ensuring that even multiple invocations of Transmogrifier + TIMDEXDataset.write() can share the same timestamp for an ETL run. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-509

ghukill · 2025-06-17T21:02:35Z



-@pytest.mark.freeze_time("2025-05-22 01:23:45.567890")
-def test_dataset_write_includes_minted_run_timestamp(tmp_path):


This test no longer makes sense as TIMDEXDataset.write() no longer mints a timestamp if one is not provided.

Either run_timestamp is provided explicitly to each DatasetRecord for writing, or it defaults to an ISO timestamp version of run_date for that record (which is required).

coveralls · 2025-06-17T21:04:56Z

Pull Request Test Coverage Report for Build 15737672508

Details

9 of 9 (100.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.08%) to 95.625%

Totals
Change from base Build 15424451546:	0.08%
Covered Lines:	306
Relevant Lines:	320

💛 - Coveralls

ghukill · 2025-06-18T13:12:23Z

-        run_timestamp = datetime.now(UTC)
        for i, record_batch in enumerate(
            itertools.batched(records_iter, self.config.write_batch_size)
        ):
-            record_dicts = [
-                {
-                    **record.to_dict(),
-                    "run_timestamp": run_timestamp,
-                }
-                for record in record_batch
-            ]


This is really the most important part of the PR: we no longer mint a run_timestamp value in this method, but instead assume/require that all DatasetRecord's that are getting written will have a value already.

ehanson8

The logic of this change is very clear and it does make sense to generate/pass run_timestamp in transmogrifier rather than here

ghukill commented Jun 17, 2025

View reviewed changes

ghukill marked this pull request as ready for review June 18, 2025 13:11

ghukill commented Jun 18, 2025

View reviewed changes

ghukill requested a review from a team June 18, 2025 13:13

Bump version to 2.1.0

029506e

ghukill force-pushed the TIMX-509-explicit-run-timestamp branch from 770f7c9 to 029506e Compare June 18, 2025 15:54

This was referenced Jun 18, 2025

TIMX 509 - add run_timestamp CLI arg MITLibraries/transmogrifier#254

Merged

TIMX 509 - run-timestamp argument for all transform commands MITLibraries/timdex-pipeline-lambdas#320

Merged

ehanson8 approved these changes Jun 24, 2025

View reviewed changes

ehanson8 self-assigned this Jun 24, 2025

ghukill merged commit a479435 into main Jun 24, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMX 509 - explicit run timestamp#149

TIMX 509 - explicit run timestamp#149
ghukill merged 2 commits intomainfrom
TIMX-509-explicit-run-timestamp

ghukill commented Jun 17, 2025 •

edited

Loading

Uh oh!

ghukill Jun 17, 2025

Uh oh!

coveralls commented Jun 17, 2025 •

edited

Loading

Uh oh!

ghukill Jun 18, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@pytest.mark.freeze_time("2025-05-22 01:23:45.567890")
		def test_dataset_write_includes_minted_run_timestamp(tmp_path):

Conversation

ghukill commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ghukill Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15737672508

Details

💛 - Coveralls

Uh oh!

ghukill Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Jun 17, 2025 •

edited

Loading

coveralls commented Jun 17, 2025 •

edited

Loading