Skip to content

A fully open-source, Infrastructure-as-Code implementation of a modern data platform. This repository provisions and deploys a complete data stack on AWS EC2 virtual machines with no managed services. Everything is automated with Terraform (OpenTofu) for infrastructure, Kubernetes for orchestration

Notifications You must be signed in to change notification settings

ashuhimself/data-platform-iac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Modern Open-Source Data Platform

A complete, production-ready data platform built with Infrastructure as Code principles. This platform runs entirely on AWS EC2 instances (no managed services) and provides a comprehensive suite of data tools for ingestion, processing, analytics, and governance.

πŸ—οΈ Architecture Overview

The platform consists of multiple Kubernetes micro-clusters, each dedicated to specific services:

  • Airflow Cluster - Workflow orchestration with dbt integration
  • Airbyte Cluster - Data integration and ELT pipelines
  • ClickHouse Cluster - High-performance analytical database
  • Trino Cluster - Distributed SQL query engine
  • Apache Superset Cluster - Business intelligence and visualization
  • Apache Ranger Cluster - Data governance and access control
  • Central Monitoring VM - Prometheus, Grafana, Loki, and Marquez (OpenLineage)

Key Features

βœ… Latest Stable Versions - All services use the most recent stable releases
βœ… GitOps Deployment - Managed via Argo CD app-of-apps pattern
βœ… Secrets Management - HashiCorp Vault with Kubernetes CSI integration
βœ… Comprehensive Monitoring - Metrics, logs, and data lineage tracking
βœ… Security First - Network policies, RBAC, resource quotas
βœ… Multi-AZ Deployment - High availability across availability zones

πŸš€ Quick Start

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Terraform/OpenTofu >= 1.6
  • kubectl >= 1.28
  • Helm >= 3.12

Step 1: Infrastructure Deployment

# Clone the repository
git clone <your-repo-url>
cd DataPlatform

# Configure Terraform variables
cd infra/environments/prod
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your AWS configuration

# Deploy infrastructure
terraform init
terraform plan
terraform apply

Step 2: Configure kubectl

# Configure kubectl for each cluster
aws eks update-kubeconfig --region us-east-1 --name airflow-cluster
aws eks update-kubeconfig --region us-east-1 --name airbyte-cluster
aws eks update-kubeconfig --region us-east-1 --name clickhouse-cluster
aws eks update-kubeconfig --region us-east-1 --name trino-cluster
aws eks update-kubeconfig --region us-east-1 --name superset-cluster
aws eks update-kubeconfig --region us-east-1 --name ranger-cluster

Step 3: Deploy Argo CD

# Install Argo CD on the management cluster
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Apply the app-of-apps
kubectl apply -f platform/argocd/app-of-apps.yaml

Step 4: Access Services

# Get Argo CD admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# Port forward to access UIs (or configure ingress)
kubectl port-forward svc/argocd-server -n argocd 8080:443
kubectl port-forward svc/airflow-webserver -n airflow 8081:8080
kubectl port-forward svc/airbyte-webapp-svc -n airbyte 8082:80
kubectl port-forward svc/superset -n superset 8083:8088
kubectl port-forward svc/grafana -n monitoring 8084:80

πŸ“‹ Service Details

Core Data Services

Service Version Purpose Access
Airflow 2.8.1+ Workflow orchestration, dbt execution :8080
Airbyte 0.50.33+ Data integration and ingestion :80
ClickHouse 23.12+ Analytical database :8123/:9000
Trino 435+ Distributed SQL engine :8080
Apache Superset 3.1.0+ BI and visualization :8088
Apache Ranger 2.4.0+ Data governance :6080

Observability Stack

Service Version Purpose Access
Prometheus 2.48.1+ Metrics collection :9090
Grafana 10.2.3+ Visualization & dashboards :3000
Loki 2.9.4+ Log aggregation :3100
Marquez 0.43.0+ Data lineage (OpenLineage) :5000

Infrastructure Components

Component Purpose
HashiCorp Vault Secrets and credential management
Argo CD GitOps continuous deployment
Nginx Ingress Load balancing and TLS termination

πŸ”§ Resource Sizing

Per-Cluster Configuration

Each micro-cluster follows the 1 master + 2 workers pattern:

Airflow Cluster

  • Master: t3.large (2 vCPU, 8 GB RAM, 50 GB storage)
  • Workers: m5.2xlarge (8 vCPU, 32 GB RAM, 100 GB storage) Γ— 2

Airbyte Cluster

  • Master: t3.large (2 vCPU, 8 GB RAM, 50 GB storage)
  • Workers: m5.2xlarge (8 vCPU, 32 GB RAM, 150 GB storage) Γ— 2

ClickHouse Cluster

  • Master: t3.large (2 vCPU, 8 GB RAM, 50 GB storage)
  • Workers: m5.2xlarge (8 vCPU, 32 GB RAM, 200 GB storage) Γ— 2

Trino Cluster

  • Master: t3.large (2 vCPU, 8 GB RAM, 50 GB storage)
  • Workers: m5.2xlarge (8 vCPU, 32 GB RAM, 100 GB storage) Γ— 2

Superset Cluster

  • Master: t3.large (2 vCPU, 8 GB RAM, 50 GB storage)
  • Workers: m5.large (2 vCPU, 8 GB RAM, 50 GB storage) Γ— 2

Ranger Cluster

  • Master: t3.large (2 vCPU, 8 GB RAM, 50 GB storage)
  • Workers: m5.large (2 vCPU, 8 GB RAM, 50 GB storage) Γ— 2

Central Monitoring VM

  • Instance: m5.4xlarge (16 vCPU, 64 GB RAM, 500 GB storage)

πŸ” Security & Access Control

Network Security

  • VPC: Dedicated VPC with public/private subnets across 3 AZs
  • Security Groups: Least-privilege access between services
  • Network Policies: Kubernetes-native traffic control

Authentication & Authorization

  • Vault Integration: All secrets managed via HashiCorp Vault
  • RBAC: Role-based access control for all services
  • TLS Everywhere: End-to-end encryption for all communications

Data Governance

  • Apache Ranger: Fine-grained access policies for ClickHouse and Trino
  • Audit Logging: Comprehensive audit trails via Loki
  • Resource Quotas: Kubernetes resource limits per namespace

πŸ“Š Monitoring & Observability

Metrics (Prometheus + Grafana)

  • Infrastructure metrics (CPU, memory, disk, network)
  • Application metrics (query performance, DAG runs, etc.)
  • Custom dashboards for each service
  • Alerting via AlertManager

Logging (Loki + Promtail)

  • Centralized log collection from all clusters
  • Structured logging with proper labels
  • Log-based alerting for error patterns
  • 14-day retention with tiered storage

Data Lineage (Marquez/OpenLineage)

  • Automatic lineage tracking for dbt transformations
  • Trino query lineage capture
  • Airflow job dependency mapping
  • REST API for lineage queries

πŸ”„ Data Flow Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Sources   β”‚ -> β”‚   Airbyte   β”‚ -> β”‚   Airflow   β”‚ -> β”‚ ClickHouse  β”‚ -> β”‚   Trino     β”‚
β”‚   (Various) β”‚    β”‚ (Ingestion) β”‚    β”‚ + dbt (ELT) β”‚    β”‚ (Storage)   β”‚    β”‚ (Query)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                    β”‚
β”‚   Ranger    β”‚ <- β”‚  Superset   β”‚ <--------------------------------------------------β”˜
β”‚(Governance) β”‚    β”‚ (BI/Viz)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Typical Workflow

  1. Data Ingestion: Airbyte extracts data from various sources
  2. Orchestration: Airflow orchestrates ELT pipeline jobs
  3. Transformation: dbt transforms raw data in ClickHouse
  4. Analytics: Trino provides SQL interface across data sources
  5. Visualization: Superset creates dashboards and reports
  6. Governance: Ranger enforces access policies
  7. Monitoring: Full observability across all components

πŸ› οΈ Development & Customization

Adding New Data Sources

  1. Create connection in Airflow
  2. Add catalog configuration in Trino
  3. Update Ranger policies
  4. Configure monitoring dashboards

Scaling Considerations

  • Horizontal: Add more worker nodes to existing clusters
  • Vertical: Increase instance sizes in infra/sizing.yaml
  • Cross-Region: Extend Terraform modules for multi-region

Backup & Recovery

  • Database Backups: Automated daily backups to S3
  • Configuration Backups: GitOps ensures configuration is version-controlled
  • Disaster Recovery: Multi-AZ deployment provides high availability

πŸ“ Repository Structure

DataPlatform/
β”œβ”€β”€ infra/                          # Terraform infrastructure code
β”‚   β”œβ”€β”€ modules/
β”‚   β”‚   β”œβ”€β”€ k8s-cluster/           # Kubernetes cluster module
β”‚   β”‚   └── monitoring-vm/         # Central monitoring VM module
β”‚   β”œβ”€β”€ environments/prod/         # Production environment
β”‚   └── sizing.yaml               # Resource sizing configuration
β”œβ”€β”€ k8s/                          # Kubernetes base configurations
β”‚   β”œβ”€β”€ base/                     # Base security configurations
β”‚   └── overlays/                 # Environment-specific overlays
β”œβ”€β”€ platform/                    # Platform services
β”‚   β”œβ”€β”€ argocd/                   # GitOps configuration
β”‚   └── vault/                    # Secrets management
β”œβ”€β”€ services/                     # Application configurations
β”‚   β”œβ”€β”€ airflow/                  # Airflow + dbt configuration
β”‚   β”œβ”€β”€ clickhouse/               # ClickHouse configuration
β”‚   β”œβ”€β”€ trino/                    # Trino configuration
β”‚   β”œβ”€β”€ superset/                 # Superset configuration
β”‚   └── ranger/                   # Ranger configuration
└── observability/                # Monitoring stack
    β”œβ”€β”€ monitoring/               # Prometheus + Grafana
    β”œβ”€β”€ logging/                  # Loki + Promtail
    └── lineage/                  # Marquez (OpenLineage)

🚨 Troubleshooting

Common Issues

1. Terraform Apply Fails

# Check AWS credentials
aws sts get-caller-identity

# Verify quotas
aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A

2. Pods Not Starting

# Check node resources
kubectl top nodes

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check resource quotas
kubectl describe quota -n <namespace>

3. Services Not Accessible

# Check service endpoints
kubectl get endpoints -n <namespace>

# Check network policies
kubectl get networkpolicy -n <namespace>

# Test connectivity
kubectl run debug --image=nicolaka/netshoot -it --rm

Health Checks

# Infrastructure health
terraform plan -detailed-exitcode

# Application health  
kubectl get pods --all-namespaces
helm list --all-namespaces

# Service connectivity
curl -f http://<service-endpoint>/health

πŸ“š Additional Resources

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes and test thoroughly
  4. Submit a pull request with detailed description

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


⚠️ Important Notes:

  • All service versions are set to latest stable releases available at deployment time
  • Update version tags in Helm values before deployment for newest releases
  • Monitor resource usage and scale accordingly for production workloads
  • Ensure proper backup procedures are in place before production use
  • Review and customize security policies based on your organization's requirements

About

A fully open-source, Infrastructure-as-Code implementation of a modern data platform. This repository provisions and deploys a complete data stack on AWS EC2 virtual machines with no managed services. Everything is automated with Terraform (OpenTofu) for infrastructure, Kubernetes for orchestration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published