This project showcases an extract, transform, and load (ETL) data using PySpark and SparkSQL by adopting a star schema design (with a slight snowflake design on the product side).
This project presents an end-to-end ETL workflow on Databricks using PySpark and SparkSQL, illustrating how to extract, transform, and load (ETL) data.
- The source data is stored in CSV format on the Databricks Distributed File System (DBFS).
- The pipeline ingests six CSV files into a Spark environment for further processing.
- PySpark’s distributed processing power is used to perform data blending, filtering, aggregation, and apply business logic.
- Data is filtered to include only “Accessories” categories and “United States” country records.
- Rolling up the data to the customer level to show aggregated total sales and total quantity metrics per customer across the filtered dataset.
- The star schema design employs one fact table (storing the core transactional data) and five dimension tables with a slight snowflake schema on the product dimension.
- The transformed data is then loaded into output folder of dbfs' FileStore.
Conclusion, By combining PySpark/SparkSQL, and a well-defined star schema with a slight snowflake extension for the product dimension, the project demonstrates an efficient end-to-end ETL workflow on Databricks.
Fact Table:
Dimension Tables:
