The platform follows a Lakehouse design pattern, utilizing a decoupled storage and compute stack built entirely on industry-standard open-source technologies.
- Object Storage: MinIO — High-performance, S3-compatible storage for structured and unstructured data.
- Catalog: Project Nessie — Provides Git-like version control (branching, tagging, merging) for the data lake.
- Table Format: Apache Iceberg — High-performance open table format enabling schema evolution and time travel.
- Query Engine: Trino — Distributed SQL engine for high-concurrency federated queries and data consumption.
- Orchestration: Apache Airflow — Manages workflow scheduling, dependency graphs, and monitoring.
- Transformation: dbt (data build tool) — Handles SQL-based transformations directly within Trino using the
dbt-trinoadapter.
- Event Streaming: Apache Kafka — High-throughput distributed messaging system for real-time data ingestion.
- Stream Processing: Apache Flink — Stateful computations over data streams for real-time windowing and aggregations.
- Discovery & Analysis: Jupyter Notebook — Interactive environment for data exploration, visualization, and SQL-based analysis.
This architecture is built entirely on open-source technologies, making it ideal for on-premise infrastructure while remaining cloud-agnostic. The entire stack is containerized and orchestrated via Kubernetes, ensuring portability and scalability.
MinIO was selected as the foundational storage layer. It provides high-performance, S3-compatible object storage for all data types (flat files, unstructured data, etc.). By using MinIO, we leverage the cost-efficiencies of modern storage while maintaining full control over the infrastructure, avoiding the egress fees and lock-in of public cloud providers.
For the data lakehouse catalog, I chose Nessie. Unlike Polaris or Unity Catalog, Nessie provides unique Git-like version control for the data lake. This allows for:
- WAP (Write-Audit-Publish) patterns: Isolate changes in a branch before merging to production.
- Atomic Commits: Ensure data consistency across multiple tables.
- Note: While Nessie excels at versioning, advanced data governance (RBAC/ABAC) would require integration with Open Policy Agent (OPA) or Apache Ranger as the platform matures.
Apache Iceberg is the core open-table format for this platform. It provides critical features such as schema evolution, time travel, and hidden partitioning. While Delta Lake is a strong alternative, Iceberg was the optimal choice here due to its first-class integration with Trino and its native REST catalog support, which aligns perfectly with the Nessie ecosystem.
Trino serves as the primary compute engine because of its massive scale-out capabilities and SQL-first approach, which is familiar to most data consumers.
- Horizontal Scaling: Utilizing Trino’s coordinator-worker architecture, I implemented KEDA (Kubernetes Event-driven Autoscaling) triggered by Prometheus metrics. This allows the cluster to scale workers dynamically based on query load, optimizing performance while minimizing idle costs.
- Performance Optimization: To handle trillions of records efficiently, the architecture advocates for Aggregated Materialized Tables. This ensures consumers query summarized datasets rather than raw partitions whenever possible.
- Modern Features: Trino’s native AI functions allow users to run inference and data enrichment directly within SQL queries.
For local development and testing, I used Kind (Kubernetes in Docker). While Prometheus is currently leveraged for Trino's autoscaling, it also provides the foundation for full-cluster observability, monitoring resource health across the entire ELT and streaming stack.
Before deploying the platform, ensure the following dependencies are installed on your local machine or Cloud VPS.
| Tool | Purpose |
|---|---|
| Docker & Docker Compose | Container runtime and multi-container orchestration. |
| Kind | Kubernetes-in-Docker for local cluster management. |
| kubectl | Command-line tool for interacting with the Kubernetes cluster. |
| Helm | Package manager for Kubernetes (used for Prometheus, KEDA, and Airflow). |
| Python 3.10+ | Required for local script execution and pipeline management (recommend using uv). |
This section covers the initialization of the local Kubernetes cluster and the deployment of the core Data Platform components.
Bootstrap the local environment by creating a Kubernetes cluster named dp.
# Create the cluster
kind create cluster --name dp --config kind-config
# Verify cluster status
kubectl get nodes
kubectl config current-contextDeploy MinIO (S3-compatible storage) and Nessie (Iceberg Catalog) via Kubernetes manifests.
cd minio_nessie
kubectl apply -f minio-storage.yamlTo support the Iceberg connector and custom configurations, we build a tailored Trino image and sideload it into the Kind nodes.
# Build custom Trino image
cd trino_iceberg/trino_custom
docker build -t trino-custom:v1 .
# Load image into Kind
kind load docker-image trino-custom:v1 --name dp
# Deploy Trino components
cd ../
kubectl apply -f trino-configmap.yaml
kubectl apply -f trino-deployment.yamlTo implement cost-efficient compute, we use Prometheus to monitor load and KEDA to scale Trino workers dynamically.
Step A: Install Prometheus Operator
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kind-prometheus prometheus-community/kube-prometheus-stack --namespace data-platform
# Apply PodMonitor for Trino workers
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f monitor.yamlStep B: Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace data-platform
# Apply the ScaledObject for Trino
kubectl apply -f keda.yamlSince this is a local setup without an Ingress Controller, use port-forwarding to access the services from your host machine.
# Trino Coordinator
kubectl port-forward svc/trino 8080:8080 -n data-platform
# Nessie Catalog
kubectl port-forward svc/catalog 19120:19120 -n data-platform
# MinIO Storage
kubectl port-forward svc/storage 9000:9000 -n data-platformThe core platform is now operational.
This section implements a robust Extract-Load-Transform (ELT) workflow using a modern data engineering stack to process the NYC Yellow Taxi Trips dataset.
The pipeline is designed to handle the monthly release cycle of the NYC TLC data, running automatically on the 1st of every month:
- Extract & Load: Parquet source files are fetched and converted into
PyArrowtables for memory-efficient handling. - Ingestion: Data is pushed to the Iceberg Bronze layer via the REST catalog using the
PyIceberglibrary. - Transformation: Using dbt-trino, data is processed through a Medallion Architecture (Bronze → Silver → Gold) to ensure data quality and provide aggregated, analytics-ready tables for consumers.
Apache Airflow serves as the control plane for this pipeline, managing task dependencies and retries.
- Ingestion Script:
airflow_dbt/dags/scripts/ingest_nyc_yellow_trips.py - Airflow DAG:
airflow_dbt/dags/nyc_yellow_trips_dag.py - dbt Project:
airflow_dbt/dags/dbt_trino
Navigate to the airflow_dbt directory to initialize the pipeline components.
We use a custom image to include the necessary Python providers and dbt dependencies.
docker build -t airflow-kind:v1 .
kind load docker-image airflow-kind:v1 --name dphelm repo add apache-airflow https://airflow.apache.org
helm repo update
helm install airflow-kind apache-airflow/airflow \
--namespace data-platform \
--set images.airflow.repository=airflow-kind \
--set images.airflow.tag=v1Port-forward the Airflow webserver to access the management console locally.
kubectl port-forward -n data-platform svc/airflow-kind-api-server 9999:8080- URL: http://localhost:9999/
- Credentials:
admin/admin
Note: For testing, manually trigger the DAG for any dates in 2025 to verify the end-to-end integration.
This section implements a real-time data stream utilizing Apache Kafka for event ingestion and Apache Flink for stateful stream processing, with Apache Iceberg serving as the final analytical sink.
The pipeline simulates a typical IoT scenario where temperature sensors emit real-time events:
- Ingestion: A Python-based Kafka producer pushes sensor data (ID and Temperature) to the
sensor-topic. - Processing: A Flink job consumes the stream and performs a 1-minute Tumbling Window aggregation. It calculates the average temperature per
sensor-id. - Sink: The processed window results are committed directly into an Iceberg table on our data platform, making them immediately available for SQL queries.
- Flink Job:
kafka_flink/flink_job.py - Producer:
kafka_flink/kafka_producer/
Navigate to the kafka_flink directory to initialize the streaming infrastructure.
We use the Strimzi Operator to manage Kafka on Kubernetes efficiently.
# Install Strimzi Operator
kubectl create namespace kafka
kubectl create -f strimzi.yaml -n kafka
# Deploy Kafka Cluster and Topic
kubectl apply -f kafka-deployment.yaml
kubectl create -f kafka-topic.yaml# Install cert-manager (prerequisite for Flink Operator)
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.18.2/cert-manager.yaml
# Install Flink Operator via Helm
helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-1.14.0/
helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator --namespace kafkaSince Flink requires specific Iceberg and Kafka connectors, we build a custom image.
# Build and sideload the Flink image (Note: This is a large image)
docker build -t flink-iceberg:v1 .
kind load docker-image flink-iceberg:v1 --name dp
# Submit the Flink deployment
kubectl apply -f flink-deployment.yamlcd kafka_producer
docker build -t sensor-producer:latest .
kind load docker-image sensor-producer:latest --name dp
kubectl apply -f kafka-producer.yamlYou can monitor the health of the streaming job via the Flink Dashboard.
kubectl port-forward -n kafka svc/flink-job-rest 8081:8081The primary interface for data consumers is Jupyter Lab, chosen for its ubiquity and intuitive support for interactive SQL and Python-based analysis.
Navigate to the _dev folder. We use uv for fast, deterministic Python dependency management.
# Sync project dependencies
uv sync
# Launch Jupyter Lab
uv run --with jupyter jupyter lab
Once the browser opens, navigate to the exploration notebook:
_dev/notebooks/User_Interface.ipynb
The notebook demonstrates federated access to both batch and streaming datasets stored in our Apache Iceberg tables.
Query the "Gold" layer tables processed via Airflow and dbt. You can perform complex analytical queries and visualize trip trends using standard Python libraries.
Monitor the incoming data stream processed by Flink. Since Flink sinks data into Iceberg, you can query real-time windowed averages using standard SQL via the Trino connector.
To tear down the entire infrastructure, including the Kind cluster and all deployed services (MinIO, Nessie, Trino, Kafka, Flink, and Airflow), run:
kind delete cluster --name dp




