Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jun 18, 2025

NOTE: this PR builds on TDA PR MITLibraries/timdex-dataset-api#149. If we are waiting on that code review and merging, it's possible to still test this PR as noted in the instructions. However, Github CI tests will continue to fail until that PR is merged.

Purpose and background context

This PR adds a new CLI argument run_timestamp that is very similar to the CLI argument run_id. These CLI arguments are passed when invoking Transmogrifier and then get applied to all rows that are written to the parquet dataset.

As outlined in TIMX-509, this supports multiple runs of Transmogrifier applying the same run_timestamp to all rows, across all invocations.

A common scenario is a large alma run. The StepFunction may invoke Transmogrifier x10 times, but we want the same run_timestamp applied to all rows written. Just as we pass the run_id each time, now we can also pass a run_timestamp. Most likely the pipeline lambda will mint this run_timestamp and pass it along.

How can a reviewer manually see the effects of these changes?

1- Clone branch and run make install to pickup dependency updates.

2- If TDA PR MITLibraries/timdex-dataset-api#149 is not yet merged, manually install the TDA version from that branch via the following:

pipenv run pip uninstall timdex-dataset-api
pipenv run pip install https://github.com/MITLibraries/timdex-dataset-api.git@TIMX-509-explicit-run-timestamp

3- Set some env vars and prepare a dataset location

export DATASET_LOCATION=output/datasets/with-timestamps
export RUN_TIMESTAMP=2025-06-18T12:34:56.789000

mkdir -p $DATASET_LOCATION

4- Set AWS Dev TIMDEX credentials

5- Run Transmogrifier three times, each time passing the same run_timestamp value

pipenv run transform --verbose \
-s libguides \
-i s3://timdex-extract-dev-222053980223/libguides/libguides-2025-06-17-daily-extracted-records-to-index_01.xml \
-o $DATASET_LOCATION \
-r run-1 \
-t $RUN_TIMESTAMP

pipenv run transform --verbose \
-s libguides \
-i s3://timdex-extract-dev-222053980223/libguides/libguides-2025-06-17-daily-extracted-records-to-index_02.xml \
-o $DATASET_LOCATION \
-r run-2 \
-t $RUN_TIMESTAMP

pipenv run transform --verbose \
-s libguides \
-i s3://timdex-extract-dev-222053980223/libguides/libguides-2025-06-17-daily-extracted-records-to-index_03.xml \
-o $DATASET_LOCATION \
-r run-3 \
-t $RUN_TIMESTAMP

6- Run a DuckDB query that shows the results of the rows written to the parquet dataset

pipenv run duckdb -c "select timdex_record_id, run_id, run_date, run_timestamp, filename from read_parquet('$DATASET_LOCATION/**/*.parquet',filename=true);"

Output:

┌──────────────────────────┬─────────┬────────────┬────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│     timdex_record_id     │ run_id  │  run_date  │       run_timestamp        │                                                 filename                                                 │
│         varchar          │ varchar │    date    │  timestamp with time zone  │                                                 varchar                                                  │
├──────────────────────────┼─────────┼────────────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ libguides:guides-1385096 │ run-3   │ 2025-06-17 │ 2025-06-18 08:34:56.789-04 │ output/datasets/with-timestamps/year=2025/month=06/day=17/65de30f0-4e24-44b7-ba15-f572807acbb6-0.parquet │
│ libguides:guides-175853  │ run-1   │ 2025-06-17 │ 2025-06-18 08:34:56.789-04 │ output/datasets/with-timestamps/year=2025/month=06/day=17/6d450b22-4252-411d-8206-49cfddbb5bdf-0.parquet │
│ libguides:guides-1253643 │ run-2   │ 2025-06-17 │ 2025-06-18 08:34:56.789-04 │ output/datasets/with-timestamps/year=2025/month=06/day=17/c78f220b-ac3c-48e9-aa72-4a99673c7f62-0.parquet │
└──────────────────────────┴─────────┴────────────┴────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This shows that for three distinct runs, three distinct parquet files, the run_timestamp value passed is applied (each run only had one record).

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: while Transmogrifier will happily still run without a --run-timestamp CLI argument, it will not accept one.

What are the relevant tickets?

ghukill added 2 commits June 18, 2025 13:14
Why these changes are being introduced:

As outlined in TIMX-509, we need the ability for multiple Transmogrifier
runs to share the same run_timestamp.  Previously, the run_timestamp
was minted by the TIMDEXDataset.write() operation as part of the TDA
library.  With that now accepting an explicit run_timestamp for each
records, and Transmogrifier providing those records, we just needed
a way to pass a run_timestamp via the Transmogrifier CLI that would
get applied to all records.

How this addresses that need:
* A new run_timestamp CLI argument
* Passed along to Transmofer .load() and init
* Defaults to run_date, which is required in the input filename,
if not provided

Side effects of this change:
* Transmogrifier will now generate records with a run_timestamp
for writing, allowing multiple Transmogrifier runs to write the
same run_timestamp where applicable

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-509
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

source_records: A set of source records to be processed.
source_file: Filepath of the input source file.
run_id: A unique identifier for this invocation of Transmogrifier.
run_id: A unique identifier associated with this ETL run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good docstring change for consistency and clarity!

@ghukill ghukill merged commit 9b301e4 into main Jun 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants