This project automates the extraction, transformation, and loading (ETL) of data from multiple sources into Google Cloud Platform (GCP), using Cloud Run, Workload Identity Federation, and GitHub Actions.
The ETL pipeline:
- Extracts data from internal & external APIs.
- Transforms and cleans it using Python (Pandas / BigQuery / Dataflow-ready scripts).
- Loads the processed data into BigQuery and triggers further reporting processes.
Deployment and authentication are handled through GitHub Actions and Workload Identity Federation (WIF) — eliminating the need for static service account keys.
| Source | Integration Method | Description |
|---|---|---|
| Google Analytics (GA4) | GCP BigQuery Export / API | Transforms raw GA4 event data stored in GCP. |
| AdMob | REST API with API Key | Fetches monetization and ad revenue metrics programmatically. |
| Adjust | REST API with API Key | Retrieves mobile attribution and campaign performance data. |
| Apple App Store (App Store Connect) | Authenticated using .p8 files |
Securely extracts app analytics and sales data using Apple’s authentication system. |
| Google Play Console | GCP Native Data Transfer Service | Integrates daily Play Store data directly into BigQuery for further processing. |
| Component | Description |
|---|---|
| GitHub Actions | CI/CD pipeline that triggers the ETL job deployment or run |
| Workload Identity Federation (WIF) | Secure authentication from GitHub to GCP without service account keys |
| Cloud Run Job | Executes the ETL code in a serverless container environment |
| BigQuery | Stores and aggregates the transformed data |
| Cloud Storage | Optional: stores raw data dumps or intermediate results |
Before proceeding, ensure you have:
- A Google Cloud Project
- Owner or Editor IAM access to create pools, providers, and service accounts
- A GitHub repository (e.g.,
mpfouziya/End-to-End-ETL-using-Google-Workflows) - Installed the gcloud CLI locally (if configuring via terminal)
This service account will execute the ETL process inside Cloud Run.
gcloud iam service-accounts create <GITHUB_DEPLOYER_SERVICE_ACCOUNT> \
--project="<PROJECT_ID>" \
--display-name="GitHub Deployer" Grant necessary roles
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<GITHUB_DEPLOYER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com" \
--role="roles/run.admin"
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<GITHUB_DEPLOYER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountUser"2.1 Create a Workload Identity Pool
gcloud iam workload-identity-pools create <POOL_NAME> \
--project="<PROJECT_ID>" \
--location="global" \
--display-name="GitHub Actions Pool"2.2 Create an OIDC Provider for GitHub
gcloud iam workload-identity-pools providers create-oidc <PROVIDER_NAME> \
--project="<PROJECT_ID>" \
--location="global" \
--workload-identity-pool=<POOL_NAME> \
--display-name="GitHub Provider" \
--issuer-uri="https://token.actions.githubusercontent.com" \
--attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
--attribute-condition="assertion.repository == '<GITHUB_USERNAME>/<REPO_NAME>'"2.3 Bind GitHub Repository to Service Account
gcloud iam service-accounts add-iam-policy-binding \
<GITHUB_DEPLOYER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<POOL_NAME>/attribute.repository/<GITHUB_USERNAME>/<REPO_NAME>"Create the file: .github/workflows/deploy.yml
gcloud run jobs execute <CLOUD_RUN_JOB_NAME> \
--region=${{ env.REGION }} \
--project=${{ env.PROJECT_ID }}✅ Verification
After pushing to main, go to GitHub → Actions → Deploy ETL to Cloud Run → Run workflow.
Expected outcomes:
- Successful authentication via OIDC
- Cloud Run job deployed
- Job visible under GCP Console → Cloud Run → Jobs
Once your ETL Python job is deployed to Cloud Run, you can automate its execution using Google Workflows and Cloud Scheduler. This approach keeps scheduling entirely on GCP — no need for GitHub cron triggers. 5.1 Create a Workflow
In your Google Cloud Console, go to Navigation Menu → Workflows → Create Workflow Name it something like <ETL_WORKFLOW_NAME> (e.g., etl-runner-workflow). Set the schedule for the trigger. Write the workflow to run the cloud run jobs one after the other if the result of one job influence the next ETL.
Example for Google Workflow YAML definition for Event & First Touch Event job:
main:
steps:
- run_event:
try:
call: http.post
args:
url: https://run.googleapis.com/v2/projects/<PROJECT_ID>/locations/<GCP_REGION>/jobs/event:run
auth:
type: OAuth2
body: {}
result: event_response
retry:
predicate: ${http.default_retry_predicate}
max_retries: 3
backoff:
initial_delay: 5
max_delay: 30
multiplier: 2
except:
as: e
steps:
- log_event_error:
call: sys.log
args:
text: ${"Failed to start event job: " + e.message}
severity: "ERROR"
- return_event_error:
return: ${"Workflow failed at event job step: " + e.message}
- wait_event:
call: wait_for_job
args:
operation_name: ${event_response.body.name}
- run_first_touch_event:
try:
call: http.post
args:
url: https://run.googleapis.com/v2/projects/<PROJECT_ID>/locations/<GCP_REGION>/jobs/first-touch-event:run
auth:
type: OAuth2
body: {}
result: fte_response
retry:
predicate: ${http.default_retry_predicate}
max_retries: 3
backoff:
initial_delay: 5
max_delay: 30
multiplier: 2
except:
as: e
steps:
- log_fte_error:
call: sys.log
args:
text: ${"Failed to start first-touch-event job: " + e.message}
severity: "ERROR"
- return_fte_error:
return: ${"Workflow failed at first-touch-event job step: " + e.message}
- wait_first_touch_event:
call: wait_for_job
args:
operation_name: ${fte_response.body.name}
# Helper subworkflow to wait for job completion
wait_for_job:
params: [operation_name]
steps:
- poll:
try:
call: http.get
args:
url: ${"https://run.googleapis.com/v2/" + operation_name}
auth:
type: OAuth2
result: status
except:
as: e
steps:
- log_poll_error:
call: sys.log
args:
text: ${"Polling failed for " + operation_name + ": " + e.message}
severity: "ERROR"
- return_poll_error:
return: ${"Polling failed: " + e.message}
- check:
switch:
- condition: ${"done" in status.body and status.body.done == true}
return: ${status.body}
- sleep:
call: sys.sleep
args:
seconds: 10
- repeat:
next: poll
This ensures your workflow triggers the Cloud Run job and waits until it finishes before continuing to any downstream steps (like sending notifications, updating BigQuery, or calling another service).