CLOUD DATA ENGINEERING

⚠️ FOR FACULTY — GUIDELINES ON COURSE CONTENT

Do not teach topics outside the defined scope of this roadmap.
If you feel a topic is important or valuable for students, discuss and coordinate with Qasim first before introducing it in class.
Do not modify, rename, or restructure any course content, file paths, or folder structure without prior approval.
Any suggestions or improvements are welcome — but must go through discussion before being applied.
Unauthorised changes to course material will not be accepted in Pull Requests.

— Qasim Hassan · Lead Data Engineer Instructor · Saylani Welfare

📑 Table of Contents

How to Use This Repo
Course Summary
Week 1 — Orientation
Section 1 — SQL
Section 2 — Python
Section 3 — Airflow
Section 4 — CI/CD, Docker & Bash Scripting
Section 5 — Agentic Vibe Engineering
Section 6 — Snowflake + DBT
Section 7 — Kafka
Section 8 — AWS
Section 9 — Azure
Final Hackathon
Why These Technologies?

🚀 How to Use This Repo

This repository is your personal workspace for the entire course. Follow the steps below to get started and submit your work.

Step 1 — Fork the Repository

Click the Fork button at the top-right of this page to create your own copy of the repo under your GitHub account.

https://github.com/aiwithqasim/cloud-data-engineering
         ↓  click Fork
https://github.com/<your-username>/cloud-data-engineering

Step 2 — Clone Your Fork Locally

git clone https://github.com/<your-username>/cloud-data-engineering.git
cd cloud-data-engineering

Step 3 — Create Your Batch Folder

Inside the repo, create a folder with your name and batch number. Keep all your class code, notes, and project files inside it throughout the course.

cloud-data-engineering/
└── students/
    └── <your-name>-<batch>/        ← your personal folder
        ├── sql/
        ├── python/
        ├── airflow/
        ├── snowflake-dbt/
        ├── kafka/
        ├── aws/
        └── azure/

Step 4 — Commit & Push After Every Class

After each class, stage your work and push it to your fork:

git add .
git commit -m "section: add <topic> notes and exercises"
git push origin main

Step 5 — Submit via Pull Request

Once you have completed the course (or a major section), open a Pull Request from your fork back to the main repo to submit your work for review.

Go to your fork on GitHub
Click Contribute → Open pull request
Set the title to: [Batch X] <Your Name> — Course Submission
In the description briefly mention: sections completed, projects built, and any highlights
Submit — your instructor will review and provide feedback

Important — Course Content Changes: Any changes to course content, folder structure, file paths, or the roadmap must be discussed with Qasim first. Do not modify, rename, or restructure any existing content without prior approval. If you have a suggestion or improvement, reach out and discuss it before making any changes. Unauthorised changes to course material will not be accepted in Pull Requests.

Note: Keep your fork up to date with the main repo as new content is added by running:
git remote add upstream https://github.com/aiwithqasim/cloud-data-engineering.git
git fetch upstream
git merge upstream/main

🗓 Course Summary

Welcome to the Cloud Data Engineering course — a comprehensive, instructor-led program designed to take you from zero to job-ready as a Cloud Data Engineer.

Section	Topic	Duration
Week 1	Orientation, Setup, GitHub, LinkedIn	1 week
Section 1	SQL	4 weeks
Section 2	Python	4 weeks
Section 3	Apache Airflow	2 weeks
Section 4	CI/CD, Docker & Bash Scripting	2 weeks
Section 5	Agentic Vibe Engineering	1 week
Section 6	Snowflake + DBT	4 weeks
Section 7	Apache Kafka	2 weeks
Section 8	AWS	4 weeks
Section 9	Azure	3 weeks
Total		~27 weeks (~7 months)

Delivery Approach

Format: Instructor-led live classes (3 hours each), recorded for replay
Frequency: 2 classes per week
Each section includes: Theory + hands-on coding + real-world projects
Projects: Every major section closes with at least one end-to-end project
Support: Community forum + office hours for doubt resolution
Prerequisites: Basic computer literacy; no prior data engineering experience needed

📂 Understanding Data Engineering (PPT)

🟢 Week 1 — Orientation

Duration: 1 week

Environment setup (VS Code, Git, Python, WSL)
GitHub account setup & repository basics
LinkedIn profile optimization for data engineering roles
Roadmap walkthrough — what to expect from the course

🗄️ Section 1 — SQL (4 weeks)

7 classes + 1 capstone project | 3 Snowflake badges

Class	Topic	Duration
Class 1	Querying, Sorting, Filtering & Set Operators	3 hrs
Class 2	Joins & Views	3 hrs
Class 3	Grouping, Subqueries & Useful Tips	3 hrs
Class 4	Modifying Data, DDL, Data Types & Constraints	3 hrs
Class 5	CTEs, Pivot, Expressions & Window Functions	3 hrs
Class 6	Indexes & Stored Procedures	3 hrs
Class 7	Interview Prep + Capstone Project	3 hrs

What you'll cover:

SELECT, filtering, sorting, set operators (UNION, INTERSECT, EXCEPT)
All JOIN types, Views (including indexed/materialized views)
GROUP BY, ROLLUP, CUBE, GROUPING SETS, subqueries, EXISTS/ANY/ALL
DML (INSERT, UPDATE, DELETE, MERGE), DDL, data types, constraints
CTEs (including recursive), PIVOT/UNPIVOT, CASE expressions
Window functions: ROW_NUMBER, RANK, LAG, LEAD, FIRST_VALUE, aggregate windows
Indexes (clustered, non-clustered, filtered, composite), stored procedures, error handling

Capstone Project:

End-to-end project: schema design, data ingestion, analytical queries, views, stored procedures
Snowflake Badge preparation walkthrough (3 badges)

🐍 Section 2 — Python (4 weeks)

6 classes + 1 ETL project

Class	Topic	Duration
Class 1	Python Foundations	3 hrs
Class 2	Dictionaries, Input & String Handling	3 hrs
Class 3	Functions, Loops & OOP	3 hrs
Class 4	File Handling, CSV, JSON & Error Handling	3 hrs
Class 5	NumPy & Matplotlib	3 hrs
Class 6	Pandas	3 hrs
+	Classes, Web Scraping (video resources)	—

What you'll cover:

Variables, control flow, lists, tuples, dictionaries, loops
Functions (default args, *args, closures), OOP (classes, methods, attributes)
File I/O, CSV, JSON, exception handling
NumPy arrays, statistics, random data generation
Matplotlib: line, scatter, histogram, chart customization
Pandas: DataFrames, indexing (loc/iloc), filtering, groupby, merging, visualization

Project: ETL pipeline with Python + Pandas + SQL

⏳ Section 3 — Airflow (2 weeks)

3 classes

Class	Topic	Duration
Class 1	Introduction, Architecture & Setup (Docker + WSL)	3 hrs
Class 2	Weather ETL Project — End-to-End Airflow Pipeline	3 hrs
Class 3	FMP Parallel ETL Pipeline on AWS EC2	3 hrs

What you'll cover:

DAG concept, core components (Scheduler, Executor, Webserver, Metadata DB, XCom)
Executor types (Local, Celery, Kubernetes), task lifecycle, Connections & Variables
Airflow 2.x vs 3.0: TaskFlow API, event-driven scheduling (Assets), React UI
PythonOperator, HttpSensor, HttpOperator, SQLExecuteQueryOperator, PostgresHook
TaskGroups for parallel execution, retry policies, backfilling

Projects:

Weather ETL Pipeline — Daily pipeline using Open-Meteo API → pandas → SQLite, deployed via Docker Compose
Parallel ETL on AWS — Production-style parallel pipeline: FMP API + S3 CSV → RDS PostgreSQL → S3 export, using TaskGroups on AWS EC2

🐋 Section 4 — CI/CD, Docker & Bash Scripting (2 weeks)

2 classes

Class	Topic	Duration
Docker Class 1	Docker Fundamentals + PostgreSQL + Data Ingestion	3 hrs
CI/CD Class 1	Continuous Integration & Deployment for Data Engineers	3 hrs

What you'll cover:

Docker:

Containers vs VMs, core commands, volumes, networking
Dockerizing Python pipelines, multi-stage builds with uv
PostgreSQL in Docker, pgAdmin, Docker Compose for multi-container setups
NY Taxi dataset ingestion with pandas + SQLAlchemy (chunked, CLI-parameterized)

CI/CD:

GitHub Actions: workflows, triggers, jobs, steps, runners, secrets
Code quality: ruff, mypy, sqlfluff, pre-commit hooks
Automated testing with pytest — unit, integration, data quality checks
Docker image builds in CI, image scanning with trivy
Deploying DAGs, dbt models, Terraform infra, Docker containers via CD pipelines
End-to-end: Python ETL → GitHub Actions → Docker → AWS

🤖 Section 5 — Agentic Vibe Engineering (1 week)

What you'll cover:

Deep dive into Claude Code (Skills, MCP, Hooks, Subagents, Sandboxes, Orchestrators)
Hands-on with Cursor, Codex, Copilot
Swarms, Agent Teams, Claude Agent SDK
Ralph Loops, GSD, Gas Town, OpenClaw, sprites.dev

Tools: Cursor · Codex · Claude · Copilot

❄️ Section 6 — Snowflake + DBT (4 weeks)

4 projects

What you'll cover:

Snowflake architecture: databases, schemas, roles, virtual warehouses
Data loading methods: Web UI, SnowSQL CLI, S3 integration, Snowpipe
Streams, Tasks, Stored Procedures, Time Travel, query optimization, cost management
dbt: models, sources, tests, documentation, snapshots, macros, CI/CD integration
SCD Type 1 & Type 2 using Snowflake Streams & Tasks

Projects:

Snowflake Data Loading — Multiple ingestion methods: Web UI, SnowSQL CLI, AWS S3 with IAM roles & Snowpipe, Time Travel, optimization, and cost management.
- 🔗 Repo
SCD Data Warehousing — End-to-end pipeline implementing SCD Type 1 & 2. Python (Faker) generates data on EC2 → Apache NiFi moves files to S3 → Snowpipe ingests → Streams & Tasks handle CDC logic. Infrastructure via Terraform.
- 🔗 Repo
DBT Fundamentals — Ultimate guide to dbt: models, sources, tests, docs, snapshots, macros, and CI/CD integration. From setup to production-grade project structure.
- 🎥 Video
End-to-End Banking Data Engineering (Snowflake + dbt + Airflow) — Full ELT pipeline on real-world banking data: raw ingestion into Snowflake, dbt staging/mart layers, data quality tests, and Airflow DAGs for orchestration.
- 🎥 Video

📡 Section 7 — Kafka (2 weeks)

3 classes

Class	Topic	Duration
Class 1	Installation + Theory + Hands-on	3 hrs
Class 2	Stock Market Kafka Project	3 hrs
Class 3	Kafka CDC Project	3 hrs

What you'll cover:

Kafka architecture: topics, partitions, producers, consumers, brokers, offsets
Setup via Docker and manual deployment on AWS EC2
Python-based producer/consumer implementations
Real-time event streaming, Change Data Capture (CDC)

Projects:

Kafka 101 — Fundamentals & Stock Market Pipeline — Core Kafka concepts with hands-on Python producer/consumer pipeline ingesting live stock market data through Kafka topics on AWS EC2.
- 🔗 Repo
Smart City Real-Time Streaming (Kafka + AWS) — End-to-end IoT data ingestion and streaming project. Covers Kafka streaming, AWS services integration, and building a production-grade pipeline to process and visualize city-wide sensor data.
- 🎥 Video

☁️ Section 8 — AWS (4 weeks)

3 tracks + 1 capstone

Track 1 — AWS Data Warehousing

(Glue · Crawler · Athena · Redshift · S3)

End-to-end AWS data engineering series: S3 ingestion → Glue Crawler schema discovery → Glue ETL transformations → serverless Athena queries → Redshift analytics → QuickSight dashboarding. Includes Python, SQL, IAM, and real-world project.

🎥 Video

Track 2 — Event Driven Architecture

(Lambda · SQS · Step Functions · SNS · EventBridge)

S3 + Lambda + CloudWatch (Stock Prices) — Serverless pipeline automating stock price data processing via S3-triggered Lambda. Covers S3 event configuration, Lambda deployment & optimization, and CloudWatch monitoring.
- 🔗 Repo
Snowflake + S3 + Lambda + EventBridge (Currency Exchange Rates) — Scheduled serverless ETL fetching live exchange rates via Lambda → raw JSON to S3 → structured data loaded into Snowflake via stored procedures. EventBridge for scheduling, Secrets Manager for credentials.
- 🔗 Repo

Track 3 — Infrastructure as Code

(ECS · EKS · CodePipeline · Terraform)

Provisioning and managing AWS infrastructure as code using Terraform
Container orchestration with ECS and EKS
Automated deployment pipelines with CodePipeline

AWS Capstone Project

AWS Masterclass for Data Engineers — Full-stack AWS data engineering project tying together S3, Glue, Athena, Redshift, Lambda, EventBridge, SQS, SNS, and Step Functions into a production-grade end-to-end pipeline.

🎥 Video

🔷 Section 9 — Azure (3 weeks)

3 tracks

Medallion Architecture (ADF + Databricks) — Implement Bronze/Silver/Gold layered data architecture using Azure Data Factory for ingestion and Azure Databricks for transformation.
Azure Fabric — End-to-end analytics platform: data integration, real-time intelligence, data warehousing, and Power BI reporting in a unified SaaS environment.
Azure Synapse Analytics — Unified analytics service combining big data processing and enterprise data warehousing with dedicated and serverless SQL pools.

🏁 Final Hackathon

The course finishes with a one-day final hackathon: an intensive, hands-on sprint where you implement a case study grounded in everything you have learned so far.

You will work from a realistic brief — similar to what you might see on a data engineering team — and translate requirements into a working solution within the day. Expect to combine skills across the stack: querying and modeling, Python automation, orchestration and quality practices, and cloud services (for example AWS or Azure patterns covered in the program), depending on what the scenario demands.

Why it matters: This is where individual topics come together. Instead of isolated exercises, you practice scoping the problem, making trade-offs, debugging under time pressure, and presenting something concrete you can talk about in interviews or portfolios.

❓ Why These Technologies?

The technologies in this course — Python, SQL, Snowflake, dbt, Airflow, Kafka, AWS, Azure — are the most in-demand in the data engineering industry today.

Each section builds on the previous one, reinforcing both theory and hands-on practice so you are job-ready by the end.

📝 Final Notes

Throughout this course you will engage in hands-on projects, assignments, and real-world case studies that simulate production data engineering challenges.

⚡ Get ready to embark on this exciting journey of becoming a proficient Cloud Data Engineer! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLOUD DATA ENGINEERING

📑 Table of Contents

🚀 How to Use This Repo

Step 1 — Fork the Repository

Step 2 — Clone Your Fork Locally

Step 3 — Create Your Batch Folder

Step 4 — Commit & Push After Every Class

Step 5 — Submit via Pull Request

🗓 Course Summary

Delivery Approach

🟢 Week 1 — Orientation

🗄️ Section 1 — SQL (4 weeks)

🐍 Section 2 — Python (4 weeks)

⏳ Section 3 — Airflow (2 weeks)

🐋 Section 4 — CI/CD, Docker & Bash Scripting (2 weeks)

🤖 Section 5 — Agentic Vibe Engineering (1 week)

❄️ Section 6 — Snowflake + DBT (4 weeks)

📡 Section 7 — Kafka (2 weeks)

☁️ Section 8 — AWS (4 weeks)

Track 1 — AWS Data Warehousing

Track 2 — Event Driven Architecture

Track 3 — Infrastructure as Code

AWS Capstone Project

🔷 Section 9 — Azure (3 weeks)

🏁 Final Hackathon

❓ Why These Technologies?

📝 Final Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

CLOUD DATA ENGINEERING

📑 Table of Contents

🚀 How to Use This Repo

Step 1 — Fork the Repository

Step 2 — Clone Your Fork Locally

Step 3 — Create Your Batch Folder

Step 4 — Commit & Push After Every Class

Step 5 — Submit via Pull Request

🗓 Course Summary

Delivery Approach

🟢 Week 1 — Orientation

🗄️ Section 1 — SQL (4 weeks)

🐍 Section 2 — Python (4 weeks)

⏳ Section 3 — Airflow (2 weeks)

🐋 Section 4 — CI/CD, Docker & Bash Scripting (2 weeks)

🤖 Section 5 — Agentic Vibe Engineering (1 week)

❄️ Section 6 — Snowflake + DBT (4 weeks)

📡 Section 7 — Kafka (2 weeks)

☁️ Section 8 — AWS (4 weeks)

Track 1 — AWS Data Warehousing

Track 2 — Event Driven Architecture

Track 3 — Infrastructure as Code

AWS Capstone Project

🔷 Section 9 — Azure (3 weeks)

🏁 Final Hackathon

❓ Why These Technologies?

📝 Final Notes