A production-grade stream processing system inspired by Apache Flink, implementing exactly-once semantics, fault tolerance, and high-throughput data processing with Python.
Uditanshu Tomar (Uditanshu.tomar@colorado.edu), Ishneet Chadha (Ishneet.chadha@colorado.edu)
- Docker & Docker Compose
- Python 3.9+
- Google Cloud SDK (only for GCP deployment)
- kubectl (only for GCP deployment)
The easiest way to run the platform is using Docker Compose.
-
Navigate to deployment directory:
cd deployment -
Start the cluster:
docker-compose up -d
-
Access the Dashboard: Open http://localhost:5000 in your browser.
-
Verify Cluster Health:
curl http://localhost:8081/cluster/metrics
-
Stop the cluster:
docker-compose down
Deploy the platform to a Google Kubernetes Engine cluster.
-
Configure GCP Project:
export GCP_PROJECT_ID="your-project-id" gcloud config set project $GCP_PROJECT_ID
-
Run Deployment Script: This script will setup GKE, build images, and deploy all services.
./deploy_to_gcp.sh
-
Access Services:
# Get External IP of the GUI kubectl get svc -n stream-processing gui
- Go to the Dashboard (http://localhost:5000).
- Click "Start Demo" in the "Control Panel".
- Watch real-time metrics update as the
DemoWeatherProcessingjob runs. - See data flowing in the "Live Data Stream" panel.
You can submit custom jobs written in Python.
Example: Word Count
# 1. Generate the job file
python examples/word_count.py
# 2. Submit to the cluster
curl -X POST http://localhost:8081/jobs/submit \
-F "job_file=@word_count_job.pkl"Monitor the Job:
# Check Status
curl http://localhost:8081/jobs/{job_id}/status- JobManager (Master): Coordinates execution, manages resources, and handles checkpoints.
- TaskManager (Worker): Executes tasks in parallel slots.
- Kafka: Handles data ingestion and inter-operator communication.
- gRPC: Used for internal control plane communication.
- RocksDB: Embedded state backend for stateful operations.
- GCS/S3: Distributed storage for fault-tolerance checkpoints.
- Exactly-Once Processing: Distributed snapshots (Chandy-Lamport).
- Fault Tolerance: Automatic failure recovery.
- High Throughput: Operator chaining & flow control.
- Stateful Operations: Windowing, Aggregations, Joins.
- Observability: Prometheus metrics & Grafana dashboards.
stream-processing-platform/
├── jobmanager/ # Control Plane (Scheduler, API)
├── taskmanager/ # Data Plane (Execution, State)
├── common/ # Shared Utils (Proto, Config)
├── gui/ # Web Dashboard
├── examples/ # Example Jobs
├── deployment/ # Docker & K8s Configs
└── scripts/ # Deployment Scripts
Key environment variables in deployment/docker-compose.yml:
TASK_SLOTS: Number of concurrent tasks per TaskManager (Default: 4).CHECKPOINT_INTERVAL: Frequency of checkpoints in ms (Default: 10000).STATE_BACKEND:rocksdbormemory.GCS_CHECKPOINT_PATH: GCS bucket for checkpoints.
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
Built with: Python, FastAPI, gRPC, Kafka, RocksDB, Docker, Kubernetes.