Skip to content

Ce 2642 adding s3 bucket for mwaa etl scripts#2064

Merged
LBH-wgreeff merged 6 commits intomainfrom
CE-2642-adding-s3-bucket-for-mwaa-etl-scripts
Jan 16, 2025
Merged

Ce 2642 adding s3 bucket for mwaa etl scripts#2064
LBH-wgreeff merged 6 commits intomainfrom
CE-2642-adding-s3-bucket-for-mwaa-etl-scripts

Conversation

@LBH-wgreeff
Copy link
Contributor

Adding new s3 bucket for ETL Scripts

@LBH-wgreeff LBH-wgreeff self-assigned this Jan 15, 2025
@LBH-wgreeff LBH-wgreeff requested a review from Tian-2017 January 15, 2025 16:54
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 15, 2025

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copy link
Contributor

@Tian-2017 Tian-2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LBH-wgreeff Looks good. Ready to approve, but I suggest renaming mwaa_etl_scripts_bucket to ecs_etl_scripts_bucket. Since most of the ETL scripts are in the airflow/dags folder, Only the etl scripts requiring ECS compute resources should be placed here. Using mwaa_etl_scripts_bucket might confuse users, making them unsure where to store their ETL scripts.

Additionally, I will rename the etl_scripts folder in the dap-airflow repo to ecs_etl_scripts shortly. Apologies for the inconvenience.

@timburke-hackit
Copy link
Contributor

I'm inclined to go a step further and say the name should be service agnostic. The logical grouping is the task / workflow not the service the script is going to run on in my mind.

@Tian-2017
Copy link
Contributor

Tian-2017 commented Jan 16, 2025

Hey Tim, do you mean keeping the name of the etl_scripts folder in the airflow repo the same as before and renaming the bucket to etl_scripts_bucket instead? @timburke-hackit

@timburke-hackit
Copy link
Contributor

I meant the bucket name - I can quite easily envisage a world where either:

  • we move compute platform - for example to kubernetes, or if Airflow sort multi-tenancy environments in v3
  • workflows that require or use more than one type of compute - so a combination of ECS tasks, Glue / EMR jobs etc

I think, to your point about not confusing users by how things are named, it's much easier to have a series of scripts that are in a structure similar to:

etl_scripts/<department>/<workflow>/some_python_task.py
etl_scripts/<department>/<workflow>/some_spark_task.py
etl_scripts/<department>/<workflow>/sql/some_bit_of_sql.sql
...

Rather than having to sync them all to different buckets. This is particularly true while people are manually uploading files to stg for testing.

@Tian-2017
Copy link
Contributor

@timburke-hackit sounds good. What you’ve said makes sense to me

@LBH-wgreeff
Copy link
Contributor Author

LBH-wgreeff commented Jan 16, 2025

Just to add, the Python script, you can specify the bucket and an optional object within the bucket, aka folder in the bucket.

@LBH-wgreeff LBH-wgreeff merged commit 0925d2c into main Jan 16, 2025
6 checks passed
@LBH-wgreeff LBH-wgreeff deleted the CE-2642-adding-s3-bucket-for-mwaa-etl-scripts branch January 16, 2025 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants