Skip to content

senmer5/DP-203-Lab7

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 

Repository files navigation

DP-203-Lab7

Using Delta Lake with Apache Spark in Azure Synapse Analytics

๐Ÿ“Š Overview

Welcome to this project! In this exercise, you will learn how to use Delta Lake with Apache Spark in the context of Azure Synapse Analytics. You will build a modern Lakehouse architecture that can process both batch and streaming data, which is crucial for a scalable and efficient data engineering workflow.

๐Ÿ“ What You Will Do

During this exercise, you will perform the following steps:

๐Ÿ”ง Set Up an Azure Synapse Analytics Workspace

  1. Create Data Lake Storage Gen2: Set up a storage account to hold your data and enable the use of Azure Data Lake Storage Gen2, which is a highly scalable and secure data storage solution.
  2. Set Up Apache Spark Pool: Create an Apache Spark pool, which provides the necessary compute resources to run Spark-based processing jobs.

๐Ÿ“ Explore Data in the Data Lake

  1. Load CSV Files as DataFrames: You will load CSV files into Spark as DataFrames, allowing you to process and transform the data.
  2. Convert Data to Delta Tables: After loading the CSV data, you will convert it into Delta tables, which offer more advanced features like ACID transactions and schema enforcement compared to standard Parquet or CSV formats.

๐Ÿ’พ Work with Delta Tables

  1. Update Existing Data: Learn how to perform updates on Delta tables, including modifying records, which is essential for maintaining up-to-date datasets.
  2. Time Travel: Utilize Delta Lakeโ€™s time travel feature to query previous versions of data. This allows you to recover older data states or debug issues by comparing different versions.
  3. Create Catalog Tables (External and Managed): You will create external and managed Delta tables in the Synapse catalog. Managed tables are fully managed by Synapse, while external tables link to data stored outside of the system.

๐Ÿ“ก Simulate Streaming Data

  1. Process IoT Events with Delta as Sink: Simulate IoT events and write them in real-time to a Delta table, providing an efficient and reliable way to handle streaming data.
  2. Store and Analyze Real-Time Data: You will learn how to store real-time streaming data and perform analysis on it, enabling real-time decision-making processes.

๐Ÿง  Use SQL

  1. Query Delta Files Using Serverless SQL Pools: Finally, you will use Synapseโ€™s serverless SQL pool to execute SQL queries directly on Delta tables, without needing to provision dedicated resources, providing flexibility and cost-efficiency.

๐Ÿงน Clean Up Resources

After completing the exercise, donโ€™t forget to clean up your resources to avoid unnecessary cost.

๐Ÿ”— Resources

This exercise is based on the official Microsoft Learn material for the DP-203 certification:

๐ŸŽฏ Key Learning Outcomes

By the end of this project, you will be able to:

  • Create, manage, and modify Delta tables in Azure Synapse Analytics.
  • Write streaming data to Delta tables.
  • Use Delta Time Travel to query historical data versions.
  • Understand the differences between external and managed Delta tables.
  • Run SQL queries on Delta files us ing the serverless SQL pool in Synapse.

Screenshots

1 2 3 4 5 7 9 10

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors