This project is part of my learning journey through the Microsoft DP-203 Data Engineering course. In this lab, I learned how to analyze, transform, and store raw CSV data from a Data Lake as optimized Parquet files using Serverless SQL Pools in Azure Synapse Analytics.
The main focus of this exercise was to apply SQL transformations to raw data stored in a Data Lake. By using CETAS (Create External Table As Select), I was able to store the results directly in an external table.
- An active Azure account with sufficient permissions.
- Azure Synapse Analytics workspace connected to a Data Lake.
- Basic knowledge of SQL, PowerShell, and Azure Portal.
- GitHub repository dp-203 for the required setup scripts.
- Used Azure Cloud Shell (PowerShell) to deploy the lab environment.
- Explored the CSV files located in the
sales/csvfolder via Synapse Studio.
- Queried the file contents using
OPENROWSETto view the structure. - Analyzed sales data including Item, Quantity, OrderDate, UnitPrice, and TaxAmount.
- Created a new dedicated database in Synapse Studio.
- Defined an external data source and file format for Parquet.
- Wrote a CETAS query to aggregate sales by product.
- Added a computed column to extract the sales year (
YEAR(OrderDate)). - Filtered the data to include only rows with a valid EmailAddress.
- Built a stored procedure to automate the transformation process.
- Results were written into the Data Lake as partitioned Parquet files.
- Deleted the resource group in Azure Portal to avoid unnecessary charges.
- Microsoft Learn - DP-203
- Azure Synapse Analytics Documentation
- GitHub Repo: MicrosoftLearning/dp-203-Data-Engineer
This lab helped me:
- Apply the fundamentals of Serverless SQL Pools in a hands-on scenario.
- Understand the benefits of using Parquet for efficient storage.
- Perform SQL transformations on semi-structured data.
- Build data pipelines without relying on a dedicated Spark environment.