Skip to content

TIMX 509 - run-timestamp argument for all transform commands#320

Merged
ghukill merged 4 commits intomainfrom
TIMX-509-run-timestamp-argument
Jun 24, 2025
Merged

TIMX 509 - run-timestamp argument for all transform commands#320
ghukill merged 4 commits intomainfrom
TIMX-509-run-timestamp-argument

Conversation

@ghukill
Copy link
Copy Markdown
Contributor

@ghukill ghukill commented Jun 18, 2025

NOTE: this PR builds upon PRs MITLibraries/timdex-dataset-api#149 and MITLibraries/transmogrifier#254.

Purpose and background context

This PR updates the TIMDEX lambda to include a --run-timestamp for all transform commands generated for Transmogrifier.

Two scenarios are supported:

  1. StepFunction includes run-timestamp as an input payload to the lambda, which is picked up and passed along
  2. It is not included, so the lambda mints a timestamp and includes for all transform commands generated

How can a reviewer manually see the effects of these changes?

NOTE: this is an example of something that moving to SAM will make simpler, but using the previous approach for now.

1- Build new docker image:

make dist-dev

2- Run docker image:

docker run \
-e AWS_ACCESS_KEY_ID=<...> \
-e AWS_SECRET_ACCESS_KEY=<...> \
-e AWS_SESSION_TOKEN=<...> \
-e TIMDEX_ALMA_EXPORT_BUCKET_ID=not-needed-here \
-e TIMDEX_S3_EXTRACT_BUCKET_ID=timdex-extract-dev-222053980223 \
-e WORKSPACE=dev \
-p 9000:8080 timdex-pipeline-lambdas-dev:latest

3- From another terminal, make curl request to generate Transmogrifier transform commands:

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
  "next-step": "transform",
  "run-date": "2025-06-17",
  "run-type": "daily",
  "source": "libguides",
  "verbose": "true",
  "run-id": "abc123",
  "run-timestamp": "2025-06-17T12:34:56.789000"
}'

With formatted output like the following, noting the inclusion of --run-timestamp=2025-06-17T12:34:56.789000 for each:

{
  "run-date": "2025-06-17",
  "run-type": "daily",
  "source": "libguides",
  "verbose": true,
  "next-step": "load",
  "transform": {
    "files-to-transform": [
      {
        "transform-command": [
          "--input-file=s3://timdex-extract-dev-222053980223/libguides/libguides-2025-06-17-daily-extracted-records-to-index_01.xml",
          "--output-location=s3://timdex-extract-dev-222053980223/dataset",
          "--source=libguides",
          "--run-id=abc123",
          "--run-timestamp=2025-06-17T12:34:56.789000" <---------
        ]
      },
      {
        "transform-command": [
          "--input-file=s3://timdex-extract-dev-222053980223/libguides/libguides-2025-06-17-daily-extracted-records-to-index_02.xml",
          "--output-location=s3://timdex-extract-dev-222053980223/dataset",
          "--source=libguides",
          "--run-id=abc123",
          "--run-timestamp=2025-06-17T12:34:56.789000" <---------
        ]
      },
      {
        "transform-command": [
          "--input-file=s3://timdex-extract-dev-222053980223/libguides/libguides-2025-06-17-daily-extracted-records-to-index_03.xml",
          "--output-location=s3://timdex-extract-dev-222053980223/dataset",
          "--source=libguides",
          "--run-id=abc123",
          "--run-timestamp=2025-06-17T12:34:56.789000" <---------
        ]
      }
    ]
  }
}

Includes new or updated dependencies?

YES: dependencies updated

Changes expectations for external applications?

YES: Transmogrifier will now recieve a --run-timestamp CLI argument

What are the relevant tickets?

ghukill added 2 commits June 18, 2025 15:09
Why these changes are being introduced:

Similar to when run-id was added as an allowed payload attribute, that
is then passed around to various command generation, so the same is needed
for run-timestamp for generating transform commands (for Transmogrifier).

How this addresses that need:
* "run-timestamp" is an allowed input payload attribute
* if included, passed to transform commands
* if absent, a run-timestamp is minted by the lambda for all transform
commands generated

The net effect is the lambda will provide the *same* run-timestamp for
all Transmogrifier commands it prepares, which ensures all writes for
the run get the same timestamp.

The only variation is whether the StepFunction passes the timestamp,
or the lambda mints it; both are supported.

Side effects of this change:
* All Transmogrifier commands will now include a --run-timestamp CLI
argument

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-509
@ghukill ghukill requested a review from a team June 18, 2025 19:23
@ghukill ghukill marked this pull request as ready for review June 18, 2025 19:34
@coveralls
Copy link
Copy Markdown

coveralls commented Jun 18, 2025

Pull Request Test Coverage Report for Build 15742189993

Details

  • 3 of 3 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 95.785%

Totals Coverage Status
Change from base Build 15424731942: 0.03%
Covered Lines: 250
Relevant Lines: 261

💛 - Coveralls

Comment thread README.md
Comment on lines -113 to +159
```bash
docker run -e TIMDEX_ALMA_EXPORT_BUCKET_ID=alma-bucket-name \
-e TIMDEX_S3_EXTRACT_BUCKET_ID=timdex-bucket-name \
-e WORKSPACE=dev \
-p 9000:8080 timdex-pipeline-lambdas-dev:latest
```
```bash
docker run -e TIMDEX_ALMA_EXPORT_BUCKET_ID=alma-bucket-name \
-e TIMDEX_S3_EXTRACT_BUCKET_ID=timdex-bucket-name \
-e WORKSPACE=dev \
-p 9000:8080 timdex-pipeline-lambdas-dev:latest
```

- POST to the container
Note: running this with next-step transform or load involves an actual S3 connection and is thus tricky to test locally. Better to push the image to Dev1 and test there.

```bash
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
"next-step": "extract",
"run-date": "2022-03-10T16:30:23Z",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": "true",
"oai-pmh-host": "https://YOUR-OAI-SOURCE/oai",
"oai-metadata-format": "oai_dc",
"oai-set-spec": "YOUR-SET-SPEC"
}'
```
```bash
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
"next-step": "extract",
"run-date": "2022-03-10T16:30:23Z",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": "true",
"oai-pmh-host": "https://YOUR-OAI-SOURCE/oai",
"oai-metadata-format": "oai_dc",
"oai-set-spec": "YOUR-SET-SPEC"
}'
```

- Observe output
```json
{
"run-date": "2022-03-10",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": true,
"next-step": "transform",
"extract": {
"extract-command": [
"--host=https://YOUR-OAI-SOURCE/oai",
"--output-file=s3://timdex-bucket-name/YOURSOURCE/YOURSOURCE-2022-03-09-daily-extracted-records-to-index.xml",
"--verbose",
"harvest",
"--metadata-format=oai_dc",
"--set-spec=YOUR-SET-SPEC",
"--from-date=2022-03-09"
]
}
-
```json
{
"run-date": "2022-03-10",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": true,
"next-step": "transform",
"extract": {
"extract-command": [
"--host=https://YOUR-OAI-SOURCE/oai",
"--output-file=s3://timdex-bucket-name/YOURSOURCE/YOURSOURCE-2022-03-09-daily-extracted-records-to-index.xml",
"--verbose",
"harvest",
"--metadata-format=oai_dc",
"--set-spec=YOUR-SET-SPEC",
"--from-date=2022-03-09"
]
}
```
}
```
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just formatting...

@ehanson8 ehanson8 self-assigned this Jun 24, 2025
Copy link
Copy Markdown
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work updating all the apps to include this!

@ghukill ghukill merged commit 5619aad into main Jun 24, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants