Skip to content

Refactor the data publication pipeline for KG to the cloud pipeline#2091

Open
matentzn wants to merge 4 commits intomainfrom
hf_kg_tag
Open

Refactor the data publication pipeline for KG to the cloud pipeline#2091
matentzn wants to merge 4 commits intomainfrom
hf_kg_tag

Conversation

@matentzn
Copy link
Contributor

@matentzn matentzn commented Feb 24, 2026

Description of the changes

This PR is migrating the KG HF pipeline to the cloud environment and so that the tagging system is fully supported.

Fixes / Resolves the following issues:

Checklist:

  • Added label to PR (e.g. enhancement or bug)
  • Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
  • Looked at the diff on github to make sure no unwanted files have been committed.
  • Made corresponding changes to the documentation
  • Added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules
  • If breaking changes occur or you need everyone to run a command locally after
    pulling in latest main, uncomment the below "Merge Notification" section and
    describe steps necessary for people
  • Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)

I tested this, but note that the tag that we ended up USING was v0.13.0 for some reason:

#!/bin/bash

source .env

if [ -z "$HF_TOKEN" ]; then
  echo "Error: HF_TOKEN environment variable is required"
  exit 1
fi

RELEASE_VERSION=0.15.0

# GCS auth for Spark Hadoop connector (locally uses ADC instead of metadata server)
export GOOGLE_APPLICATION_CREDENTIALS="${GOOGLE_APPLICATION_CREDENTIALS:-$HOME/.config/gcloud/application_default_credentials.json}"

# Required by cloud globals.yml but not used by data_publication
export RUN_NAME="${RUN_NAME:-unused}"
export RUNTIME_GCP_PROJECT_ID="${RUNTIME_GCP_PROJECT_ID:-unused}"
export RUNTIME_GCP_BUCKET="${RUNTIME_GCP_BUCKET:-mtrx-us-central1-hub-dev-storage}"

uv run kedro run -e cloud --pipeline=data_publication
image

@matentzn matentzn self-assigned this Feb 24, 2026
@matentzn matentzn requested a review from a team as a code owner February 24, 2026 18:48
@matentzn matentzn requested a review from lvijnck February 24, 2026 18:48
@matentzn matentzn added the enhancement improving an existing system or feature to work better. label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement improving an existing system or feature to work better.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant