🏗️ Data Lakehouse Project

Building a Modern Data Platform with DuckDB, dbt, and MinIO

📋 Table of Contents

🎯 Overview
🛠️ Prerequisites
🚀 Setup Instructions
📊 Data Ingestion Process
📁 Project Structure
🔧 Configuration
📈 Usage
📝 Documentation

🎯 Overview

This project implements a modern data lakehouse architecture using:

Object Storage: MinIO (S3-compatible) for raw data lake storage
Compute Engine: DuckDB for fast analytical queries
Transformation: dbt-duckdb for data modeling & transformation
Data Quality: Soda Core for automated quality checks
BI Tool: Apache Superset / Metabase for data visualization

The project follows the Medallion Architecture with three layers:

🥉 Bronze Layer: Raw, unprocessed data from sources
🥈 Silver Layer: Validated, standardized, and enriched data
🥇 Gold Layer: Business-ready analytics and reporting

🛠️ Prerequisites

Docker and Docker Compose
Python 3.8+
pip package manager

🚀 Setup Instructions

1. Clone the Repository

git clone <repository-url>
cd lakehouse-project

2. Install Python Dependencies

pip install -r requirements.txt

3. Start MinIO Server

docker-compose -f docker/docker-compose.yml up -d

Wait for MinIO to be ready (approximately 30 seconds), then access the MinIO Console at http://localhost:9001 with credentials:

Access Key: minioadmin
Secret Key: minioadmin123

4. Verify MinIO Buckets

After startup, MinIO will automatically create the following buckets:

bronze - Raw data storage
silver - Cleaned data storage
gold - Analytics data storage

📊 Data Ingestion Process

Step 1: Download UCI Online Retail II Dataset

python scripts/download_data.py

This script will:

Download the UCI Online Retail II dataset from the official source
Extract the Excel file
Display information about the dataset structure

Step 2: Ingest Data into MinIO

python scripts/ingest_data.py

This script will:

Load the Excel file with multiple sheets
Convert each sheet to Parquet format
Upload the Parquet files to the bronze bucket in MinIO
Add timestamp to filenames for versioning

📁 Project Structure

lakehouse-project/
├── 🐳 docker/
│   ├── docker-compose.yml
│   └── minio/
├── 💾 data/
│   ├── bronze/          # Raw data
│   ├── silver/          # Cleaned data
│   └── gold/            # Analytics data
├── 🔄 dbt/
│   ├── models/
│   │   ├── staging/     # Bronze → Silver
│   │   ├── intermediate/ # Silver processing
│   │   └── marts/       # Gold layer
│   ├── tests/           # Data quality tests
│   ├── macros/          # Reusable functions
│   └── dbt_project.yml
├── ✅ soda/
│   ├── configuration.yml
│   ├── checks.yml
│   └── data_source.yml
├── 📜 scripts/
│   ├── download_data.py # Download UCI dataset
│   ├── ingest_data.py   # Load data to MinIO
│   └── generate_data.py # Mock data generator
├── 📝 documents/
│   └── plan.md          # Project plan
├── 📓 notebooks/        # Exploratory analysis
├── 🎯 airflow/dags/     # Workflow orchestration
└── requirements.txt     # Python dependencies

🔧 Configuration

MinIO Configuration

The MinIO server is configured with:

Endpoint: http://localhost:9000
Console: http://localhost:9001
Access Key: minioadmin
Secret Key: minioadmin123
Default buckets: bronze, silver, gold

Data Ingestion Configuration

The ingestion process is configured in scripts/ingest_data.py:

Source: Excel file from UCI Online Retail II dataset
Format: Parquet (for efficient storage and query performance)
Destination: MinIO bronze bucket
Naming convention: online_retail_ii/{sheet_name}_{timestamp}.parquet

📈 Usage

Manual Data Ingestion

Download the dataset: python scripts/download_data.py
Ingest to MinIO: python scripts/ingest_data.py

Data Pipeline Flow

Raw Data Sources → Bronze Layer (MinIO) → Silver Layer (dbt) → Gold Layer (Analytics)

Monitoring

Access MinIO Console at http://localhost:9001 to monitor data storage
Check logs from data ingestion scripts for any errors
Use dbt logs to monitor transformation processes

📝 Documentation

Project Plan - Detailed project architecture and implementation phases
Script documentation is included in each Python file

🚀 Next Steps

After data ingestion, you can proceed with:

Data Transformation: Use dbt to transform Bronze → Silver layer
Data Quality: Implement Soda Core checks
Analytics: Build Gold layer models and dashboards
Orchestration: Schedule pipelines with Airflow

This project follows the principles of a modern data lakehouse architecture, enabling scalable and efficient data processing for analytics and machine learning workloads.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.serena		.serena
dbt		dbt
docker		docker
documents		documents
notebooks		notebooks
scripts		scripts
test		test
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏗️ Data Lakehouse Project

📋 Table of Contents

🎯 Overview

🛠️ Prerequisites

🚀 Setup Instructions

1. Clone the Repository

2. Install Python Dependencies

3. Start MinIO Server

4. Verify MinIO Buckets

📊 Data Ingestion Process

Step 1: Download UCI Online Retail II Dataset

Step 2: Ingest Data into MinIO

📁 Project Structure

🔧 Configuration

MinIO Configuration

Data Ingestion Configuration

📈 Usage

Manual Data Ingestion

Data Pipeline Flow

Monitoring

📝 Documentation

🚀 Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏗️ Data Lakehouse Project

📋 Table of Contents

🎯 Overview

🛠️ Prerequisites

🚀 Setup Instructions

1. Clone the Repository

2. Install Python Dependencies

3. Start MinIO Server

4. Verify MinIO Buckets

📊 Data Ingestion Process

Step 1: Download UCI Online Retail II Dataset

Step 2: Ingest Data into MinIO

📁 Project Structure

🔧 Configuration

MinIO Configuration

Data Ingestion Configuration

📈 Usage

Manual Data Ingestion

Data Pipeline Flow

Monitoring

📝 Documentation

🚀 Next Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages