TIMX 509 - add run_timestamp CLI arg #254
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: this PR builds on TDA PR MITLibraries/timdex-dataset-api#149. If we are waiting on that code review and merging, it's possible to still test this PR as noted in the instructions. However, Github CI tests will continue to fail until that PR is merged.
Purpose and background context
This PR adds a new CLI argument
run_timestampthat is very similar to the CLI argumentrun_id. These CLI arguments are passed when invoking Transmogrifier and then get applied to all rows that are written to the parquet dataset.As outlined in TIMX-509, this supports multiple runs of Transmogrifier applying the same
run_timestampto all rows, across all invocations.A common scenario is a large
almarun. The StepFunction may invoke Transmogrifier x10 times, but we want the samerun_timestampapplied to all rows written. Just as we pass therun_ideach time, now we can also pass arun_timestamp. Most likely the pipeline lambda will mint thisrun_timestampand pass it along.How can a reviewer manually see the effects of these changes?
1- Clone branch and run
make installto pickup dependency updates.2- If TDA PR MITLibraries/timdex-dataset-api#149 is not yet merged, manually install the TDA version from that branch via the following:
3- Set some env vars and prepare a dataset location
4- Set AWS Dev TIMDEX credentials
5- Run Transmogrifier three times, each time passing the same
run_timestampvalue6- Run a DuckDB query that shows the results of the rows written to the parquet dataset
pipenv run duckdb -c "select timdex_record_id, run_id, run_date, run_timestamp, filename from read_parquet('$DATASET_LOCATION/**/*.parquet',filename=true);"Output:
This shows that for three distinct runs, three distinct parquet files, the
run_timestampvalue passed is applied (each run only had one record).Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: while Transmogrifier will happily still run without a
--run-timestampCLI argument, it will not accept one.What are the relevant tickets?