Skip to content

manavgupta26/Ecommerce-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›’ E-Commerce Data Pipeline

A production-grade, end-to-end data engineering project built with Apache Airflow, PostgreSQL, and Docker. This pipeline implements a medallion architecture (Bronze β†’ Silver β†’ Gold) to process e-commerce data & generate business intelligence insights.

Pipeline Architecture Airflow PostgreSQL Docker Python


πŸ“‹ Table of Contents


🎯 Overview

This project demonstrates a real-world data engineering pipeline that:

  • Ingests data from REST APIs and generates realistic e-commerce transactions
  • Implements data quality checks and transformations
  • Creates business-ready analytics with RFM segmentation(recency, frequency, monetary), inventory health monitoring, and campaign ROI analysis
  • Tracks historical changes using SCD Type 2 (Slowly Changing Dimensions)
  • Provides self-service BI dashboards with Metabase

Use Case: E-commerce analytics platform for tracking sales, customers, inventory, and marketing campaigns.


πŸ—οΈ Architecture

Medallion Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA SOURCES                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Fake Store   β”‚  β”‚    Order     β”‚  β”‚  Inventory   β”‚   β”‚
β”‚  β”‚     API      β”‚  β”‚  Generator   β”‚  β”‚   Generator  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              πŸ₯‰ BRONZE LAYER (Raw Data)                 β”‚
β”‚  β€’ bronze_products      β€’ bronze_orders                 β”‚
β”‚  β€’ bronze_customers     β€’ bronze_inventory              β”‚
β”‚  β€’ bronze_campaigns                                     β”‚
β”‚  βœ“ Full audit trail     βœ“ Source metadata               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓ Data Quality Checks
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           πŸ₯ˆ SILVER LAYER (Cleaned & Validated)         β”‚
β”‚  β€’ silver_products      β€’ silver_orders                 β”‚
β”‚  β€’ silver_customers     β€’ silver_inventory              β”‚
β”‚  β€’ silver_campaigns                                     β”‚
β”‚  βœ“ Deduplicated        βœ“ Type validated                 β”‚
β”‚  βœ“ Enriched fields     βœ“ Referential integrity          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓ Aggregations & Analytics
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         πŸ₯‡ GOLD LAYER (Business Analytics)              β”‚
β”‚  β€’ gold_daily_revenue                                   β”‚
β”‚  β€’ gold_product_performance                             β”‚
β”‚  β€’ gold_customer_segments (RFM)                         β”‚
β”‚  β€’ gold_inventory_health                                β”‚
β”‚  β€’ gold_campaign_roi                                    β”‚
β”‚  βœ“ Business KPIs       βœ“ Ready for BI tools             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              πŸ“Š VISUALIZATION LAYER                     β”‚
β”‚                    (Metabase)                           β”‚
β”‚  β€’ Revenue Dashboards   β€’ Customer Insights             β”‚
β”‚  β€’ Product Analytics    β€’ Inventory Alerts              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Infrastructure Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DOCKER COMPOSE STACK                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚  Airflow   β”‚  β”‚  Airflow   β”‚  β”‚  Warehouse  β”‚     β”‚
β”‚  β”‚ Webserver  β”‚  β”‚ Scheduler  β”‚  β”‚  PostgreSQL β”‚     β”‚
β”‚  β”‚ :8080      β”‚  β”‚            β”‚  β”‚  :5433      β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚  Airflow   β”‚  β”‚  Metadata  β”‚  β”‚  Metabase   β”‚     β”‚
β”‚  β”‚   Worker   β”‚  β”‚ PostgreSQL β”‚  β”‚  :3000      β”‚     β”‚
β”‚  β”‚            β”‚  β”‚  :5432     β”‚  β”‚             β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Tech Stack

Component Technology Version
Orchestration Apache Airflow 2.8.1
Data Warehouse PostgreSQL 15
Containerization Docker Compose 3.8
Language Python 3.11
Visualization Metabase Latest
Object Storage MinIO Latest
API Source Fake Store API -

Python Libraries

  • apache-airflow-providers-postgres - Database connectivity
  • pandas - Data manipulation
  • faker - Synthetic data generation
  • requests - API calls
  • psycopg2-binary - PostgreSQL adapter

✨ Features

πŸ”„ Data Pipeline

  • βœ… Automated daily ingestion from REST APIs
  • βœ… Incremental loading with upsert logic
  • βœ… Data quality validation (email format, price ranges, referential integrity)
  • βœ… Error handling with retries and alerting
  • βœ… Full audit trail (source, timestamp, pipeline_run_id)

πŸ“Š Analytics

  • βœ… RFM Customer Segmentation (Recency, Frequency, Monetary)
  • βœ… Product Performance Metrics (revenue, profit margin, rankings)
  • βœ… Inventory Health Monitoring (low stock, overstock, dead stock alerts)
  • βœ… Campaign ROI Analysis (ROAS, cost per order, conversion rates)
  • βœ… Daily Revenue Trends (YoY growth, weekend patterns)

πŸ—‚οΈ Advanced Features

  • βœ… SCD Type 2 - Historical tracking of price changes and customer tier progression
  • βœ… XCom - Inter-task communication for data sharing
  • βœ… External Task Sensors - DAG dependency management
  • βœ… Dynamic Task Generation - Scalable pipeline design
  • βœ… Metabase Integration - Self-service BI dashboards

πŸš€ Setup Instructions

Prerequisites

  • Docker Desktop (4.0+)
  • Docker Compose (3.8+)
  • 8GB RAM minimum
  • 10GB free disk space

Installation

  1. Clone the repository
git clone https://github.com/yourusername/ecommerce-data-pipeline.git
cd ecommerce-data-pipeline
  1. Create required directories
mkdir -p dags logs plugins data/incoming data/archive sql scripts tests
  1. Build and start services
docker-compose build
docker-compose up -d
  1. Wait for services to initialize (2-3 minutes)
# Check service health
docker-compose ps
  1. Access Airflow UI
  1. Configure Airflow connection
# Add warehouse database connection
docker exec -it  airflow connections add 'warehouse_db' \
    --conn-type 'postgres' \
    --conn-host 'warehouse-db' \
    --conn-schema 'data_warehouse' \
    --conn-login 'warehouse' \
    --conn-password 'warehouse' \
    --conn-port 5432

Or via Airflow UI:

  • Go to Admin β†’ Connections β†’ Add
  • Connection Id: warehouse_db
  • Connection Type: Postgres
  • Host: warehouse-db
  • Schema: data_warehouse
  • Login: warehouse
  • Password: warehouse
  • Port: 5432
  1. Trigger the pipeline
# In Airflow UI, unpause and trigger:
# 1. bronze_ingestion
# 2. silver_transformation (auto-triggers)
# 3. gold_analytics (auto-triggers)
# 4. scd_maintenance
  1. Access Metabase (optional)

⭐ Star History

If you find this project helpful, please consider giving it a star!

Built by Manav Gupta

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors