- Do not teach topics outside the defined scope of this roadmap.
- If you feel a topic is important or valuable for students, discuss and coordinate with Qasim first before introducing it in class.
- Do not modify, rename, or restructure any course content, file paths, or folder structure without prior approval.
- Any suggestions or improvements are welcome — but must go through discussion before being applied.
- Unauthorised changes to course material will not be accepted in Pull Requests.
— Qasim Hassan · Lead Data Engineer Instructor · Saylani Welfare
- How to Use This Repo
- Course Summary
- Week 1 — Orientation
- Section 1 — SQL
- Section 2 — Python
- Section 3 — Airflow
- Section 4 — CI/CD, Docker & Bash Scripting
- Section 5 — Agentic Vibe Engineering
- Section 6 — Snowflake + DBT
- Section 7 — Kafka
- Section 8 — AWS
- Section 9 — Azure
- Final Hackathon
- Why These Technologies?
This repository is your personal workspace for the entire course. Follow the steps below to get started and submit your work.
Click the Fork button at the top-right of this page to create your own copy of the repo under your GitHub account.
https://github.com/aiwithqasim/cloud-data-engineering
↓ click Fork
https://github.com/<your-username>/cloud-data-engineering
git clone https://github.com/<your-username>/cloud-data-engineering.git
cd cloud-data-engineeringInside the repo, create a folder with your name and batch number. Keep all your class code, notes, and project files inside it throughout the course.
cloud-data-engineering/
└── students/
└── <your-name>-<batch>/ ← your personal folder
├── sql/
├── python/
├── airflow/
├── snowflake-dbt/
├── kafka/
├── aws/
└── azure/
After each class, stage your work and push it to your fork:
git add .
git commit -m "section: add <topic> notes and exercises"
git push origin mainOnce you have completed the course (or a major section), open a Pull Request from your fork back to the main repo to submit your work for review.
- Go to your fork on GitHub
- Click Contribute → Open pull request
- Set the title to:
[Batch X] <Your Name> — Course Submission - In the description briefly mention: sections completed, projects built, and any highlights
- Submit — your instructor will review and provide feedback
Important — Course Content Changes: Any changes to course content, folder structure, file paths, or the roadmap must be discussed with Qasim first. Do not modify, rename, or restructure any existing content without prior approval. If you have a suggestion or improvement, reach out and discuss it before making any changes. Unauthorised changes to course material will not be accepted in Pull Requests.
Note: Keep your fork up to date with the main repo as new content is added by running:
git remote add upstream https://github.com/aiwithqasim/cloud-data-engineering.git git fetch upstream git merge upstream/main
Welcome to the Cloud Data Engineering course — a comprehensive, instructor-led program designed to take you from zero to job-ready as a Cloud Data Engineer.
| Section | Topic | Duration |
|---|---|---|
| Week 1 | Orientation, Setup, GitHub, LinkedIn | 1 week |
| Section 1 | SQL | 4 weeks |
| Section 2 | Python | 4 weeks |
| Section 3 | Apache Airflow | 2 weeks |
| Section 4 | CI/CD, Docker & Bash Scripting | 2 weeks |
| Section 5 | Agentic Vibe Engineering | 1 week |
| Section 6 | Snowflake + DBT | 4 weeks |
| Section 7 | Apache Kafka | 2 weeks |
| Section 8 | AWS | 4 weeks |
| Section 9 | Azure | 3 weeks |
| Total | ~27 weeks (~7 months) |
- Format: Instructor-led live classes (3 hours each), recorded for replay
- Frequency: 2 classes per week
- Each section includes: Theory + hands-on coding + real-world projects
- Projects: Every major section closes with at least one end-to-end project
- Support: Community forum + office hours for doubt resolution
- Prerequisites: Basic computer literacy; no prior data engineering experience needed
📂 Understanding Data Engineering (PPT)
Duration: 1 week
- Environment setup (VS Code, Git, Python, WSL)
- GitHub account setup & repository basics
- LinkedIn profile optimization for data engineering roles
- Roadmap walkthrough — what to expect from the course
7 classes + 1 capstone project | 3 Snowflake badges
| Class | Topic | Duration |
|---|---|---|
| Class 1 | Querying, Sorting, Filtering & Set Operators | 3 hrs |
| Class 2 | Joins & Views | 3 hrs |
| Class 3 | Grouping, Subqueries & Useful Tips | 3 hrs |
| Class 4 | Modifying Data, DDL, Data Types & Constraints | 3 hrs |
| Class 5 | CTEs, Pivot, Expressions & Window Functions | 3 hrs |
| Class 6 | Indexes & Stored Procedures | 3 hrs |
| Class 7 | Interview Prep + Capstone Project | 3 hrs |
What you'll cover:
- SELECT, filtering, sorting, set operators (UNION, INTERSECT, EXCEPT)
- All JOIN types, Views (including indexed/materialized views)
- GROUP BY, ROLLUP, CUBE, GROUPING SETS, subqueries, EXISTS/ANY/ALL
- DML (INSERT, UPDATE, DELETE, MERGE), DDL, data types, constraints
- CTEs (including recursive), PIVOT/UNPIVOT, CASE expressions
- Window functions: ROW_NUMBER, RANK, LAG, LEAD, FIRST_VALUE, aggregate windows
- Indexes (clustered, non-clustered, filtered, composite), stored procedures, error handling
Capstone Project:
- End-to-end project: schema design, data ingestion, analytical queries, views, stored procedures
- Snowflake Badge preparation walkthrough (3 badges)
6 classes + 1 ETL project
| Class | Topic | Duration |
|---|---|---|
| Class 1 | Python Foundations | 3 hrs |
| Class 2 | Dictionaries, Input & String Handling | 3 hrs |
| Class 3 | Functions, Loops & OOP | 3 hrs |
| Class 4 | File Handling, CSV, JSON & Error Handling | 3 hrs |
| Class 5 | NumPy & Matplotlib | 3 hrs |
| Class 6 | Pandas | 3 hrs |
| + | Classes, Web Scraping (video resources) | — |
What you'll cover:
- Variables, control flow, lists, tuples, dictionaries, loops
- Functions (default args, *args, closures), OOP (classes, methods, attributes)
- File I/O, CSV, JSON, exception handling
- NumPy arrays, statistics, random data generation
- Matplotlib: line, scatter, histogram, chart customization
- Pandas: DataFrames, indexing (loc/iloc), filtering, groupby, merging, visualization
Project: ETL pipeline with Python + Pandas + SQL
3 classes
| Class | Topic | Duration |
|---|---|---|
| Class 1 | Introduction, Architecture & Setup (Docker + WSL) | 3 hrs |
| Class 2 | Weather ETL Project — End-to-End Airflow Pipeline | 3 hrs |
| Class 3 | FMP Parallel ETL Pipeline on AWS EC2 | 3 hrs |
What you'll cover:
- DAG concept, core components (Scheduler, Executor, Webserver, Metadata DB, XCom)
- Executor types (Local, Celery, Kubernetes), task lifecycle, Connections & Variables
- Airflow 2.x vs 3.0: TaskFlow API, event-driven scheduling (Assets), React UI
- PythonOperator, HttpSensor, HttpOperator, SQLExecuteQueryOperator, PostgresHook
- TaskGroups for parallel execution, retry policies, backfilling
Projects:
- Weather ETL Pipeline — Daily pipeline using Open-Meteo API → pandas → SQLite, deployed via Docker Compose
- Parallel ETL on AWS — Production-style parallel pipeline: FMP API + S3 CSV → RDS PostgreSQL → S3 export, using TaskGroups on AWS EC2
2 classes
| Class | Topic | Duration |
|---|---|---|
| Docker Class 1 | Docker Fundamentals + PostgreSQL + Data Ingestion | 3 hrs |
| CI/CD Class 1 | Continuous Integration & Deployment for Data Engineers | 3 hrs |
What you'll cover:
Docker:
- Containers vs VMs, core commands, volumes, networking
- Dockerizing Python pipelines, multi-stage builds with
uv - PostgreSQL in Docker, pgAdmin, Docker Compose for multi-container setups
- NY Taxi dataset ingestion with pandas + SQLAlchemy (chunked, CLI-parameterized)
CI/CD:
- GitHub Actions: workflows, triggers, jobs, steps, runners, secrets
- Code quality:
ruff,mypy,sqlfluff, pre-commit hooks - Automated testing with
pytest— unit, integration, data quality checks - Docker image builds in CI, image scanning with
trivy - Deploying DAGs, dbt models, Terraform infra, Docker containers via CD pipelines
- End-to-end: Python ETL → GitHub Actions → Docker → AWS
What you'll cover:
- Deep dive into Claude Code (Skills, MCP, Hooks, Subagents, Sandboxes, Orchestrators)
- Hands-on with Cursor, Codex, Copilot
- Swarms, Agent Teams, Claude Agent SDK
- Ralph Loops, GSD, Gas Town, OpenClaw, sprites.dev
Tools: Cursor · Codex · Claude · Copilot
4 projects
What you'll cover:
- Snowflake architecture: databases, schemas, roles, virtual warehouses
- Data loading methods: Web UI, SnowSQL CLI, S3 integration, Snowpipe
- Streams, Tasks, Stored Procedures, Time Travel, query optimization, cost management
- dbt: models, sources, tests, documentation, snapshots, macros, CI/CD integration
- SCD Type 1 & Type 2 using Snowflake Streams & Tasks
Projects:
-
Snowflake Data Loading — Multiple ingestion methods: Web UI, SnowSQL CLI, AWS S3 with IAM roles & Snowpipe, Time Travel, optimization, and cost management.
- 🔗 Repo
-
SCD Data Warehousing — End-to-end pipeline implementing SCD Type 1 & 2. Python (Faker) generates data on EC2 → Apache NiFi moves files to S3 → Snowpipe ingests → Streams & Tasks handle CDC logic. Infrastructure via Terraform.
- 🔗 Repo
-
DBT Fundamentals — Ultimate guide to dbt: models, sources, tests, docs, snapshots, macros, and CI/CD integration. From setup to production-grade project structure.
- 🎥 Video
-
End-to-End Banking Data Engineering (Snowflake + dbt + Airflow) — Full ELT pipeline on real-world banking data: raw ingestion into Snowflake, dbt staging/mart layers, data quality tests, and Airflow DAGs for orchestration.
- 🎥 Video
3 classes
| Class | Topic | Duration |
|---|---|---|
| Class 1 | Installation + Theory + Hands-on | 3 hrs |
| Class 2 | Stock Market Kafka Project | 3 hrs |
| Class 3 | Kafka CDC Project | 3 hrs |
What you'll cover:
- Kafka architecture: topics, partitions, producers, consumers, brokers, offsets
- Setup via Docker and manual deployment on AWS EC2
- Python-based producer/consumer implementations
- Real-time event streaming, Change Data Capture (CDC)
Projects:
-
Kafka 101 — Fundamentals & Stock Market Pipeline — Core Kafka concepts with hands-on Python producer/consumer pipeline ingesting live stock market data through Kafka topics on AWS EC2.
- 🔗 Repo
-
Smart City Real-Time Streaming (Kafka + AWS) — End-to-end IoT data ingestion and streaming project. Covers Kafka streaming, AWS services integration, and building a production-grade pipeline to process and visualize city-wide sensor data.
- 🎥 Video
3 tracks + 1 capstone
(Glue · Crawler · Athena · Redshift · S3)
End-to-end AWS data engineering series: S3 ingestion → Glue Crawler schema discovery → Glue ETL transformations → serverless Athena queries → Redshift analytics → QuickSight dashboarding. Includes Python, SQL, IAM, and real-world project.
- 🎥 Video
(Lambda · SQS · Step Functions · SNS · EventBridge)
-
S3 + Lambda + CloudWatch (Stock Prices) — Serverless pipeline automating stock price data processing via S3-triggered Lambda. Covers S3 event configuration, Lambda deployment & optimization, and CloudWatch monitoring.
- 🔗 Repo
-
Snowflake + S3 + Lambda + EventBridge (Currency Exchange Rates) — Scheduled serverless ETL fetching live exchange rates via Lambda → raw JSON to S3 → structured data loaded into Snowflake via stored procedures. EventBridge for scheduling, Secrets Manager for credentials.
- 🔗 Repo
(ECS · EKS · CodePipeline · Terraform)
- Provisioning and managing AWS infrastructure as code using Terraform
- Container orchestration with ECS and EKS
- Automated deployment pipelines with CodePipeline
AWS Masterclass for Data Engineers — Full-stack AWS data engineering project tying together S3, Glue, Athena, Redshift, Lambda, EventBridge, SQS, SNS, and Step Functions into a production-grade end-to-end pipeline.
- 🎥 Video
3 tracks
-
Medallion Architecture (ADF + Databricks) — Implement Bronze/Silver/Gold layered data architecture using Azure Data Factory for ingestion and Azure Databricks for transformation.
-
Azure Fabric — End-to-end analytics platform: data integration, real-time intelligence, data warehousing, and Power BI reporting in a unified SaaS environment.
-
Azure Synapse Analytics — Unified analytics service combining big data processing and enterprise data warehousing with dedicated and serverless SQL pools.
The course finishes with a one-day final hackathon: an intensive, hands-on sprint where you implement a case study grounded in everything you have learned so far.
You will work from a realistic brief — similar to what you might see on a data engineering team — and translate requirements into a working solution within the day. Expect to combine skills across the stack: querying and modeling, Python automation, orchestration and quality practices, and cloud services (for example AWS or Azure patterns covered in the program), depending on what the scenario demands.
Why it matters: This is where individual topics come together. Instead of isolated exercises, you practice scoping the problem, making trade-offs, debugging under time pressure, and presenting something concrete you can talk about in interviews or portfolios.
The technologies in this course — Python, SQL, Snowflake, dbt, Airflow, Kafka, AWS, Azure — are the most in-demand in the data engineering industry today.
Each section builds on the previous one, reinforcing both theory and hands-on practice so you are job-ready by the end.
Throughout this course you will engage in hands-on projects, assignments, and real-world case studies that simulate production data engineering challenges.
⚡ Get ready to embark on this exciting journey of becoming a proficient Cloud Data Engineer! 🚀