Building a Modern Data Platform with DuckDB, dbt, and MinIO
- 🎯 Overview
- 🛠️ Prerequisites
- 🚀 Setup Instructions
- 📊 Data Ingestion Process
- 📁 Project Structure
- 🔧 Configuration
- 📈 Usage
- 📝 Documentation
This project implements a modern data lakehouse architecture using:
- Object Storage: MinIO (S3-compatible) for raw data lake storage
- Compute Engine: DuckDB for fast analytical queries
- Transformation: dbt-duckdb for data modeling & transformation
- Data Quality: Soda Core for automated quality checks
- BI Tool: Apache Superset / Metabase for data visualization
The project follows the Medallion Architecture with three layers:
- 🥉 Bronze Layer: Raw, unprocessed data from sources
- 🥈 Silver Layer: Validated, standardized, and enriched data
- 🥇 Gold Layer: Business-ready analytics and reporting
- Docker and Docker Compose
- Python 3.8+
- pip package manager
git clone <repository-url>
cd lakehouse-projectpip install -r requirements.txtdocker-compose -f docker/docker-compose.yml up -dWait for MinIO to be ready (approximately 30 seconds), then access the MinIO Console at http://localhost:9001 with credentials:
- Access Key:
minioadmin - Secret Key:
minioadmin123
After startup, MinIO will automatically create the following buckets:
bronze- Raw data storagesilver- Cleaned data storagegold- Analytics data storage
python scripts/download_data.pyThis script will:
- Download the UCI Online Retail II dataset from the official source
- Extract the Excel file
- Display information about the dataset structure
python scripts/ingest_data.pyThis script will:
- Load the Excel file with multiple sheets
- Convert each sheet to Parquet format
- Upload the Parquet files to the
bronzebucket in MinIO - Add timestamp to filenames for versioning
lakehouse-project/
├── 🐳 docker/
│ ├── docker-compose.yml
│ └── minio/
├── 💾 data/
│ ├── bronze/ # Raw data
│ ├── silver/ # Cleaned data
│ └── gold/ # Analytics data
├── 🔄 dbt/
│ ├── models/
│ │ ├── staging/ # Bronze → Silver
│ │ ├── intermediate/ # Silver processing
│ │ └── marts/ # Gold layer
│ ├── tests/ # Data quality tests
│ ├── macros/ # Reusable functions
│ └── dbt_project.yml
├── ✅ soda/
│ ├── configuration.yml
│ ├── checks.yml
│ └── data_source.yml
├── 📜 scripts/
│ ├── download_data.py # Download UCI dataset
│ ├── ingest_data.py # Load data to MinIO
│ └── generate_data.py # Mock data generator
├── 📝 documents/
│ └── plan.md # Project plan
├── 📓 notebooks/ # Exploratory analysis
├── 🎯 airflow/dags/ # Workflow orchestration
└── requirements.txt # Python dependencies
The MinIO server is configured with:
- Endpoint:
http://localhost:9000 - Console:
http://localhost:9001 - Access Key:
minioadmin - Secret Key:
minioadmin123 - Default buckets:
bronze,silver,gold
The ingestion process is configured in scripts/ingest_data.py:
- Source: Excel file from UCI Online Retail II dataset
- Format: Parquet (for efficient storage and query performance)
- Destination: MinIO
bronzebucket - Naming convention:
online_retail_ii/{sheet_name}_{timestamp}.parquet
- Download the dataset:
python scripts/download_data.py - Ingest to MinIO:
python scripts/ingest_data.py
Raw Data Sources → Bronze Layer (MinIO) → Silver Layer (dbt) → Gold Layer (Analytics)
- Access MinIO Console at
http://localhost:9001to monitor data storage - Check logs from data ingestion scripts for any errors
- Use dbt logs to monitor transformation processes
- Project Plan - Detailed project architecture and implementation phases
- Script documentation is included in each Python file
After data ingestion, you can proceed with:
- Data Transformation: Use dbt to transform Bronze → Silver layer
- Data Quality: Implement Soda Core checks
- Analytics: Build Gold layer models and dashboards
- Orchestration: Schedule pipelines with Airflow
This project follows the principles of a modern data lakehouse architecture, enabling scalable and efficient data processing for analytics and machine learning workloads.