A production-grade, end-to-end data engineering project built with Apache Airflow, PostgreSQL, and Docker. This pipeline implements a medallion architecture (Bronze β Silver β Gold) to process e-commerce data & generate business intelligence insights.
- Overview
- Architecture
- Tech Stack
- Features
- Project Structure
- Setup Instructions
- DAG Workflows
- Data Model
- Analytics & Visualizations
- Production Features
- Sample Queries
- Future Enhancements
- Contributing
- License
This project demonstrates a real-world data engineering pipeline that:
- Ingests data from REST APIs and generates realistic e-commerce transactions
- Implements data quality checks and transformations
- Creates business-ready analytics with RFM segmentation(recency, frequency, monetary), inventory health monitoring, and campaign ROI analysis
- Tracks historical changes using SCD Type 2 (Slowly Changing Dimensions)
- Provides self-service BI dashboards with Metabase
Use Case: E-commerce analytics platform for tracking sales, customers, inventory, and marketing campaigns.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA SOURCES β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Fake Store β β Order β β Inventory β β
β β API β β Generator β β Generator β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π₯ BRONZE LAYER (Raw Data) β
β β’ bronze_products β’ bronze_orders β
β β’ bronze_customers β’ bronze_inventory β
β β’ bronze_campaigns β
β β Full audit trail β Source metadata β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
β Data Quality Checks
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π₯ SILVER LAYER (Cleaned & Validated) β
β β’ silver_products β’ silver_orders β
β β’ silver_customers β’ silver_inventory β
β β’ silver_campaigns β
β β Deduplicated β Type validated β
β β Enriched fields β Referential integrity β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
β Aggregations & Analytics
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π₯ GOLD LAYER (Business Analytics) β
β β’ gold_daily_revenue β
β β’ gold_product_performance β
β β’ gold_customer_segments (RFM) β
β β’ gold_inventory_health β
β β’ gold_campaign_roi β
β β Business KPIs β Ready for BI tools β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π VISUALIZATION LAYER β
β (Metabase) β
β β’ Revenue Dashboards β’ Customer Insights β
β β’ Product Analytics β’ Inventory Alerts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCKER COMPOSE STACK β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββ ββββββββββββββ βββββββββββββββ β
β β Airflow β β Airflow β β Warehouse β β
β β Webserver β β Scheduler β β PostgreSQL β β
β β :8080 β β β β :5433 β β
β ββββββββββββββ ββββββββββββββ βββββββββββββββ β
β β
β ββββββββββββββ ββββββββββββββ βββββββββββββββ β
β β Airflow β β Metadata β β Metabase β β
β β Worker β β PostgreSQL β β :3000 β β
β β β β :5432 β β β β
β ββββββββββββββ ββββββββββββββ βββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Version |
|---|---|---|
| Orchestration | Apache Airflow | 2.8.1 |
| Data Warehouse | PostgreSQL | 15 |
| Containerization | Docker Compose | 3.8 |
| Language | Python | 3.11 |
| Visualization | Metabase | Latest |
| Object Storage | MinIO | Latest |
| API Source | Fake Store API | - |
apache-airflow-providers-postgres- Database connectivitypandas- Data manipulationfaker- Synthetic data generationrequests- API callspsycopg2-binary- PostgreSQL adapter
- β Automated daily ingestion from REST APIs
- β Incremental loading with upsert logic
- β Data quality validation (email format, price ranges, referential integrity)
- β Error handling with retries and alerting
- β Full audit trail (source, timestamp, pipeline_run_id)
- β RFM Customer Segmentation (Recency, Frequency, Monetary)
- β Product Performance Metrics (revenue, profit margin, rankings)
- β Inventory Health Monitoring (low stock, overstock, dead stock alerts)
- β Campaign ROI Analysis (ROAS, cost per order, conversion rates)
- β Daily Revenue Trends (YoY growth, weekend patterns)
- β SCD Type 2 - Historical tracking of price changes and customer tier progression
- β XCom - Inter-task communication for data sharing
- β External Task Sensors - DAG dependency management
- β Dynamic Task Generation - Scalable pipeline design
- β Metabase Integration - Self-service BI dashboards
- Docker Desktop (4.0+)
- Docker Compose (3.8+)
- 8GB RAM minimum
- 10GB free disk space
- Clone the repository
git clone https://github.com/yourusername/ecommerce-data-pipeline.git
cd ecommerce-data-pipeline- Create required directories
mkdir -p dags logs plugins data/incoming data/archive sql scripts tests- Build and start services
docker-compose build
docker-compose up -d- Wait for services to initialize (2-3 minutes)
# Check service health
docker-compose ps- Access Airflow UI
- URL: http://localhost:8080
- Username:
airflow - Password:
airflow
- Configure Airflow connection
# Add warehouse database connection
docker exec -it airflow connections add 'warehouse_db' \
--conn-type 'postgres' \
--conn-host 'warehouse-db' \
--conn-schema 'data_warehouse' \
--conn-login 'warehouse' \
--conn-password 'warehouse' \
--conn-port 5432Or via Airflow UI:
- Go to Admin β Connections β Add
- Connection Id:
warehouse_db - Connection Type:
Postgres - Host:
warehouse-db - Schema:
data_warehouse - Login:
warehouse - Password:
warehouse - Port:
5432
- Trigger the pipeline
# In Airflow UI, unpause and trigger:
# 1. bronze_ingestion
# 2. silver_transformation (auto-triggers)
# 3. gold_analytics (auto-triggers)
# 4. scd_maintenance- Access Metabase (optional)
- URL: http://localhost:3000
- Setup account and connect to
warehouse-db:5432/data_warehouse
Built by Manav Gupta