Skip to content

xiangivyli/data_engineer_portfolio

Repository files navigation

Data Engineer Portfolio

The subdivision about data engineering (part of my data science projects portfolio), it covers airflow with high-level features and involves popular tools, like data transformation (dbt), infrastructure as code (terraform), data quality (soda), data visualisation (streamlit, Power BI), cloud-based data warehouse(BigQuery, Snowflake), etc.

And also adds tiny projects for on-premises databases and Business Intelligence (BI) Tools.

The purpose of each project is to automate end-to-end data pipelines (from raw data to data reporting), equipped with containerisation and infrastructure as code.

Some projects are folders in this repository, some projects are independent repositories for easy execution, and some tiny projects are written as a blog on my website.

Table of Contents

Part A Cloud-based Data Warehouse

Tools:

  • Python with Jupyter Notebook
  • Data Transformation: dbt
  • Data Loading: Airflow (Astro Cli)
  • Data Visualisation: Power BI
  • Data Quality Testing: Soda
  • Data Lake: Google Cloud Storage
  • Data Warehouse: BigQuery
  • Data Orchestration: Airflow

Objectives:

  • extract raw data from Kaggle, and process data for a read-to-use dataset
  • reduce file size and identify schema by using parquet files
  • achieve automation and monitorization with Airflow and dbt
  • visualize data for insights with Power BI

Tools:

  • Data Extraction, Transformation, Validation: API, Python
  • Data Orchestration: Airflow
  • Database: DuckDB
  • Data Reporting: Streamlit
  • Containerization: Docker and Docker Compose

Objectives:

  • Ingest pm2.5 data into DuckDB daily
  • Transformation is triggered by data ingestion in Airflow
  • Streamlit container keeps running and monitors the pm2.5 data in real-time

Part B On-premises Database

Tool: MySQL

Objectives:

  • identify how diseases begin and progress
  • integration of genetics and healthcare data
  • research-ready, well-curated and well-documented data

Tool: SQL Server

Objectives:

  • Split a table into a fact table and dimension tables
  • Set datatype, primary key, foreign key and referential integrity

Part C Business Intelligence Tools (BI)

Tool: Google Analytics and Looker

Objectives:

  • map the persona of customers
  • identify the performance of products
  • identify the pattern of activity
  • the funnel diagram shows the buyer's journey

Tool: Python and Power BI

Objectives:

  • Prepare a cleansed dataset for analysis
  • A logical story to explain why the mix and weighting of assessment types changed the final result

Tool: Tableau

Objectives:

  • Provide users a platform to retrieve information about GDP, Life Satisfaction, and Education Level for countries in different year
  • Give a general idea about this information for regions
  • Check the relationship between education level and GDP per capita

Part D Data Transformation - dbt bootcamp

Tools:

  • Python 3.10.13
  • Environment: Codespaces
  • Data warehouse: Snowflake
  • Data transformation: dbt
  • BI tool: Preset
  • Data Quality: Great Expectation
  • Orchestration: Dagster

Objectives:

  • Use dbt to connect Snowflake and visualise the data transformation process
  • A logical way to process data and visualise results with golden layer data

About

The subdivision about data engineer, it covers airflow with high level features and involves popular tools, like data transformation (dbt), infrastructure as code (terraform), data quality (soda), data visualisation (streamlit, Power BI), cloud-based data warehouse(BigQuery, Snowflake), etc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors