Skip to content

dannykhant/data-platform-oss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open-source Data Platform

Diagram

System Architecture

The platform follows a Lakehouse design pattern, utilizing a decoupled storage and compute stack built entirely on industry-standard open-source technologies.

1. Data Platform (Core)

  • Object Storage: MinIO — High-performance, S3-compatible storage for structured and unstructured data.
  • Catalog: Project Nessie — Provides Git-like version control (branching, tagging, merging) for the data lake.
  • Table Format: Apache Iceberg — High-performance open table format enabling schema evolution and time travel.
  • Query Engine: Trino — Distributed SQL engine for high-concurrency federated queries and data consumption.

2. ELT Pipeline

  • Orchestration: Apache Airflow — Manages workflow scheduling, dependency graphs, and monitoring.
  • Transformation: dbt (data build tool) — Handles SQL-based transformations directly within Trino using the dbt-trino adapter.

3. Streaming Pipeline

  • Event Streaming: Apache Kafka — High-throughput distributed messaging system for real-time data ingestion.
  • Stream Processing: Apache Flink — Stateful computations over data streams for real-time windowing and aggregations.

4. User Interface

  • Discovery & Analysis: Jupyter Notebook — Interactive environment for data exploration, visualization, and SQL-based analysis.

Design Philosophy & Technical Justification

This architecture is built entirely on open-source technologies, making it ideal for on-premise infrastructure while remaining cloud-agnostic. The entire stack is containerized and orchestrated via Kubernetes, ensuring portability and scalability.

1. Storage: MinIO

MinIO was selected as the foundational storage layer. It provides high-performance, S3-compatible object storage for all data types (flat files, unstructured data, etc.). By using MinIO, we leverage the cost-efficiencies of modern storage while maintaining full control over the infrastructure, avoiding the egress fees and lock-in of public cloud providers.

2. Catalog: Project Nessie

For the data lakehouse catalog, I chose Nessie. Unlike Polaris or Unity Catalog, Nessie provides unique Git-like version control for the data lake. This allows for:

  • WAP (Write-Audit-Publish) patterns: Isolate changes in a branch before merging to production.
  • Atomic Commits: Ensure data consistency across multiple tables.
  • Note: While Nessie excels at versioning, advanced data governance (RBAC/ABAC) would require integration with Open Policy Agent (OPA) or Apache Ranger as the platform matures.

3. Table Format: Apache Iceberg

Apache Iceberg is the core open-table format for this platform. It provides critical features such as schema evolution, time travel, and hidden partitioning. While Delta Lake is a strong alternative, Iceberg was the optimal choice here due to its first-class integration with Trino and its native REST catalog support, which aligns perfectly with the Nessie ecosystem.

4. Query Engine: Trino

Trino serves as the primary compute engine because of its massive scale-out capabilities and SQL-first approach, which is familiar to most data consumers.

  • Horizontal Scaling: Utilizing Trino’s coordinator-worker architecture, I implemented KEDA (Kubernetes Event-driven Autoscaling) triggered by Prometheus metrics. This allows the cluster to scale workers dynamically based on query load, optimizing performance while minimizing idle costs.
  • Performance Optimization: To handle trillions of records efficiently, the architecture advocates for Aggregated Materialized Tables. This ensures consumers query summarized datasets rather than raw partitions whenever possible.
  • Modern Features: Trino’s native AI functions allow users to run inference and data enrichment directly within SQL queries.

5. Infrastructure: Kind & Prometheus

For local development and testing, I used Kind (Kubernetes in Docker). While Prometheus is currently leveraged for Trino's autoscaling, it also provides the foundation for full-cluster observability, monitoring resource health across the entire ELT and streaming stack.

Prerequisites

Before deploying the platform, ensure the following dependencies are installed on your local machine or Cloud VPS.

Tool Purpose
Docker & Docker Compose Container runtime and multi-container orchestration.
Kind Kubernetes-in-Docker for local cluster management.
kubectl Command-line tool for interacting with the Kubernetes cluster.
Helm Package manager for Kubernetes (used for Prometheus, KEDA, and Airflow).
Python 3.10+ Required for local script execution and pipeline management (recommend using uv).

Deployment Guide

This section covers the initialization of the local Kubernetes cluster and the deployment of the core Data Platform components.


1. Infrastructure Setup (Kind)

Bootstrap the local environment by creating a Kubernetes cluster named dp.

# Create the cluster
kind create cluster --name dp --config kind-config

# Verify cluster status
kubectl get nodes
kubectl config current-context

2. Core Storage & Catalog

Deploy MinIO (S3-compatible storage) and Nessie (Iceberg Catalog) via Kubernetes manifests.

cd minio_nessie
kubectl apply -f minio-storage.yaml

3. Distributed Query Engine (Trino)

To support the Iceberg connector and custom configurations, we build a tailored Trino image and sideload it into the Kind nodes.

# Build custom Trino image
cd trino_iceberg/trino_custom
docker build -t trino-custom:v1 .

# Load image into Kind
kind load docker-image trino-custom:v1 --name dp

# Deploy Trino components
cd ../
kubectl apply -f trino-configmap.yaml
kubectl apply -f trino-deployment.yaml

4. Observability & Autoscaling (Prometheus + KEDA)

To implement cost-efficient compute, we use Prometheus to monitor load and KEDA to scale Trino workers dynamically.

Step A: Install Prometheus Operator

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kind-prometheus prometheus-community/kube-prometheus-stack --namespace data-platform

# Apply PodMonitor for Trino workers
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f monitor.yaml

Step B: Install KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace data-platform

# Apply the ScaledObject for Trino
kubectl apply -f keda.yaml

5. Port Forwarding

Since this is a local setup without an Ingress Controller, use port-forwarding to access the services from your host machine.

# Trino Coordinator
kubectl port-forward svc/trino 8080:8080 -n data-platform

# Nessie Catalog
kubectl port-forward svc/catalog 19120:19120 -n data-platform

# MinIO Storage
kubectl port-forward svc/storage 9000:9000 -n data-platform

The core platform is now operational.

ELT Pipeline

This section implements a robust Extract-Load-Transform (ELT) workflow using a modern data engineering stack to process the NYC Yellow Taxi Trips dataset.

Workflow Design

The pipeline is designed to handle the monthly release cycle of the NYC TLC data, running automatically on the 1st of every month:

  1. Extract & Load: Parquet source files are fetched and converted into PyArrow tables for memory-efficient handling.
  2. Ingestion: Data is pushed to the Iceberg Bronze layer via the REST catalog using the PyIceberg library.
  3. Transformation: Using dbt-trino, data is processed through a Medallion Architecture (Bronze → Silver → Gold) to ensure data quality and provide aggregated, analytics-ready tables for consumers.

Orchestration

Apache Airflow serves as the control plane for this pipeline, managing task dependencies and retries.

  • Ingestion Script: airflow_dbt/dags/scripts/ingest_nyc_yellow_trips.py
  • Airflow DAG: airflow_dbt/dags/nyc_yellow_trips_dag.py
  • dbt Project: airflow_dbt/dags/dbt_trino

Environment Setup

Navigate to the airflow_dbt directory to initialize the pipeline components.

1. Build & Load Airflow Image

We use a custom image to include the necessary Python providers and dbt dependencies.

docker build -t airflow-kind:v1 .
kind load docker-image airflow-kind:v1 --name dp

2. Deploy Airflow via Helm

helm repo add apache-airflow https://airflow.apache.org
helm repo update

helm install airflow-kind apache-airflow/airflow \
  --namespace data-platform \
  --set images.airflow.repository=airflow-kind \
  --set images.airflow.tag=v1

3. Accessing the UI

Port-forward the Airflow webserver to access the management console locally.

kubectl port-forward -n data-platform svc/airflow-kind-api-server 9999:8080

Note: For testing, manually trigger the DAG for any dates in 2025 to verify the end-to-end integration.

Airflow

Streaming Pipeline

This section implements a real-time data stream utilizing Apache Kafka for event ingestion and Apache Flink for stateful stream processing, with Apache Iceberg serving as the final analytical sink.

Workflow Design

The pipeline simulates a typical IoT scenario where temperature sensors emit real-time events:

  1. Ingestion: A Python-based Kafka producer pushes sensor data (ID and Temperature) to the sensor-topic.
  2. Processing: A Flink job consumes the stream and performs a 1-minute Tumbling Window aggregation. It calculates the average temperature per sensor-id.
  3. Sink: The processed window results are committed directly into an Iceberg table on our data platform, making them immediately available for SQL queries.
  • Flink Job: kafka_flink/flink_job.py
  • Producer: kafka_flink/kafka_producer/

Environment Setup

Navigate to the kafka_flink directory to initialize the streaming infrastructure.

1. Deploy Kafka (via Strimzi)

We use the Strimzi Operator to manage Kafka on Kubernetes efficiently.

# Install Strimzi Operator
kubectl create namespace kafka
kubectl create -f strimzi.yaml -n kafka

# Deploy Kafka Cluster and Topic
kubectl apply -f kafka-deployment.yaml
kubectl create -f kafka-topic.yaml

2. Deploy Flink Operator

# Install cert-manager (prerequisite for Flink Operator)
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.18.2/cert-manager.yaml

# Install Flink Operator via Helm
helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-1.14.0/
helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator --namespace kafka

3. Build & Deploy Flink Job

Since Flink requires specific Iceberg and Kafka connectors, we build a custom image.

# Build and sideload the Flink image (Note: This is a large image)
docker build -t flink-iceberg:v1 .
kind load docker-image flink-iceberg:v1 --name dp

# Submit the Flink deployment
kubectl apply -f flink-deployment.yaml

4. Run IoT Sensor Producer

cd kafka_producer
docker build -t sensor-producer:latest .
kind load docker-image sensor-producer:latest --name dp
kubectl apply -f kafka-producer.yaml

Monitoring

You can monitor the health of the streaming job via the Flink Dashboard.

kubectl port-forward -n kafka svc/flink-job-rest 8081:8081

Flink

User Interface & Data Exploration

The primary interface for data consumers is Jupyter Lab, chosen for its ubiquity and intuitive support for interactive SQL and Python-based analysis.

Local Environment Setup

Navigate to the _dev folder. We use uv for fast, deterministic Python dependency management.

# Sync project dependencies
uv sync

# Launch Jupyter Lab
uv run --with jupyter jupyter lab

Once the browser opens, navigate to the exploration notebook:

_dev/notebooks/User_Interface.ipynb


Unified Data Discovery

The notebook demonstrates federated access to both batch and streaming datasets stored in our Apache Iceberg tables.

1. NYC Yellow Taxi Exploration (Batch ELT)

Query the "Gold" layer tables processed via Airflow and dbt. You can perform complex analytical queries and visualize trip trends using standard Python libraries.

Query Chart

2. IoT Sensor Analysis (Real-time Streaming)

Monitor the incoming data stream processed by Flink. Since Flink sinks data into Iceberg, you can query real-time windowed averages using standard SQL via the Trino connector.

Query


Resource Cleanup

To tear down the entire infrastructure, including the Kind cluster and all deployed services (MinIO, Nessie, Trino, Kafka, Flink, and Airflow), run:

kind delete cluster --name dp

About

Open-source Data Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors