Apache Flink stream processing jobs for the Sunbird Lern platform. Each job consumes events from Kafka and processes user lifecycle operations — certificate generation, user deletion cleanup, ownership transfer, notifications, and ML workflows — writing results to YugabyteDB, Elasticsearch, and external APIs.
- Modules
- Prerequisites
- Local Development Setup
- Redis (optional)
- Cloud Storage Configuration
- Building the Docker Image
- CI/CD — GitHub Actions
| Module | Description |
|---|---|
jobs-core |
Shared Flink utilities, Kafka connectors, Redis cache, serde, base config |
collection-certificate-generator |
Generates and registers certificates for course completions |
collection-cert-pre-processor |
Pre-processes certificate generation requests (integrated into certificate-generator) |
legacy-certificate-migrator |
Migrates legacy certificates to the new registry |
notification-job |
Sends notifications via email, SMS, and push |
notification-sdk |
Shared notification SDK (pure Java, no Flink dependency) |
user-deletion-cleanup |
Cleans up user data across services on account deletion |
user-ownership-transfer |
Transfers ownership of content and assets between users |
program-user-info |
Syncs program user information for ML workflows |
ml-user-delete |
Handles user deletion in ML services |
ml-transfer-ownership |
Transfers ownership in ML services |
jobs-distribution |
Packages all jobs into a single deployable Docker image |
Make sure these are installed before you begin:
- Java 11 — verify with
java -version - Maven 3.8+ — verify with
mvn -version - Docker Desktop — verify with
docker --version- Allocate at least 6 GB RAM to Docker Desktop (Settings > Resources > Memory). The default 3.8 GB is not enough.
- Git — verify with
git --version - Registry service — must be running on port
8000. This is required by the certificate generation jobs for certificate registry operations.
Follow these steps in order. The full setup takes about 5 minutes.
git clone https://github.com/Sunbird-Lern/data-pipeline.git
cd data-pipelinecd docker
docker compose up -dThis starts Elasticsearch, YugabyteDB, and Kafka.
Wait about 60 seconds for YugabyteDB to initialize. You can check progress with:
docker compose ps # all containers should show "Up"Still inside the docker/ directory, run the migration script to create the required keyspaces and tables:
./init-yugabyte.shThis downloads CQL migration files from sunbird-spark-installer and executes them. By default it uses dev as the keyspace prefix (e.g. dev_sunbird_courses) and the develop branch.
./init-yugabyte.sh sb # use 'sb' as keyspace prefix instead
./init-yugabyte.sh dev main # use a different branchYou only need to run this once. Run it again after docker compose down -v (which deletes volumes).
docker exec -it kafka sh
# Inside the container:
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic sunbirddev.issue.certificate.request
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic sunbirddev.lms.notification.job.request
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic sunbirddev.delete.user.feed
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic sunbirddev.user.ownership.transfer
# Add more topics as needed for the job you are running
exitGo back to the repository root and build:
cd ..
mvn clean install -DskipTestsThis takes a few minutes the first time (Maven downloads dependencies). A successful build ends with BUILD SUCCESS. All job jars will be in their respective target/ directories.
To build for a specific cloud provider:
mvn clean install -DskipTests -Paws # AWS S3
mvn clean install -DskipTests -Pgcloud # Google Cloud StorageIf no profile is specified, the default build targets Azure.
Use this option to run the job the same way it runs in production.
-
Download and extract Flink 1.18.1:
wget https://dlcdn.apache.org/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz tar xzf flink-1.18.1-bin-scala_2.12.tgz
-
Start the Flink cluster:
cd flink-1.18.1 ./bin/start-cluster.shVerify: open http://localhost:8081 — you should see the Flink dashboard with 1 TaskManager.
-
Set the cloud storage environment variables.
-
Submit the job. Example for
collection-certificate-generator:./bin/flink run -m localhost:8081 \ ../lms-jobs/credential-generator/collection-certificate-generator/target/collection-certificate-generator-1.0.0.jar
Verify: the job should appear in the Flink dashboard at http://localhost:8081 with status
RUNNING. -
Produce a test event:
docker exec -it kafka sh # Inside the container: kafka-console-producer.sh --bootstrap-server localhost:9092 --topic sunbirddev.issue.certificate.request # Type a JSON event and press Enter
Watch the Flink task logs in the dashboard (
Job > Task Managers > Logs).
Use this option when you want to step through code with a debugger.
-
Open the project in IntelliJ (
File > Open> select the rootpom.xml). -
In the job's
pom.xml(e.g.lms-jobs/credential-generator/collection-certificate-generator/pom.xml), make these temporary changes:Do not commit these changes. Revert them before raising a PR.
Add
flink-clientsas a dependency:<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients</artifactId> <version>${flink.version}</version> </dependency>
Comment out the
providedscope onflink-streaming-scala:<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_${scala.version}</artifactId> <version>${flink.version}</version> <!-- <scope>provided</scope> --> </dependency>
-
In the job's StreamTask file (e.g.
CertificateGeneratorStreamTask.scala), switch to a local execution environment:Do not commit this change either.
// implicit val env: StreamExecutionEnvironment = FlinkUtil.getExecutionContext(config) implicit val env: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironment()
-
Set the cloud storage environment variables in IntelliJ's run configuration (
Run > Edit Configurations > Environment variables). -
Right-click the StreamTask file >
RunorDebug. -
Produce a test event to trigger the job:
docker exec -it kafka sh # Inside the container: kafka-console-producer.sh --bootstrap-server localhost:9092 --topic sunbirddev.issue.certificate.request # Type a JSON event and press Enter
Watch the IntelliJ console for output.
| Service | URL |
|---|---|
| Elasticsearch | http://localhost:9200 |
| YugabyteDB UI | http://localhost:9001 |
| YugabyteDB (CQL) | localhost:9042 |
| Kafka | localhost:9092 |
cd docker
docker compose down # stop containers, keep data
docker compose down -v # stop containers and delete all dataRedis is disabled by default (redis.enabled = false in jobs-core/src/main/resources/base-config.conf). Only start it if the job you are running explicitly enables it.
cd docker
docker compose --profile redis up -dCloud storage is needed for jobs that upload/download content artifacts (e.g. certificate generation). If you are only testing event processing that doesn't involve file uploads, you can skip this.
Set these environment variables before running a job:
export cloud_storage_type=azure
export cloud_storage_auth_type=ACCESS_KEY
export azure_storage_key=your-account-name
export azure_storage_secret=your-account-key
export azure_storage_container=your-container-nameexport cloud_storage_type=aws
export cloud_storage_auth_type=ACCESS_KEY
export aws_storage_key=your-access-key-id
export aws_storage_secret=your-secret-access-key
export aws_storage_container=your-s3-bucket-nameexport cloud_storage_type=gcloud
export cloud_storage_auth_type=ACCESS_KEY
export gcloud_storage_key=your-client-email
export gcloud_storage_secret=/path/to/key.json
export gcloud_storage_container=your-gcs-bucket-nameThe jobs-distribution module packages all jobs into a single Docker image. The build is split by cloud provider — only the plugins required for that cloud are included.
Azure (default)
mvn clean install -DskipTests -Pazure
cd jobs-distribution && mvn package -DskipTests -Pazure && cd ..
docker build --target azure -t data-pipeline:azure jobs-distribution/GCP
mvn clean install -DskipTests -Pgcloud
cd jobs-distribution && mvn package -DskipTests -Pgcloud && cd ..
docker build --target gcloud -t data-pipeline:gcloud jobs-distribution/AWS
mvn clean install -DskipTests -Paws
cd jobs-distribution && mvn package -DskipTests -Paws && cd ..
docker build --target aws -t data-pipeline:aws jobs-distribution/The build.yml workflow runs on every Git tag push. It builds all modules, packages the distribution, and pushes the Docker image to a container registry.
| Variable | Description |
|---|---|
CSP |
Cloud provider: azure (default), gcloud, or aws |
REGISTRY_PROVIDER |
Registry type: azure, gcp, dockerhub, or leave unset for GHCR |
GitHub Container Registry (GHCR) — default, no setup needed. Uses the built-in GITHUB_TOKEN.
DockerHub
| Secret | Example |
|---|---|
REGISTRY_USERNAME |
myusername |
REGISTRY_PASSWORD |
DockerHub password or access token |
REGISTRY_NAME |
docker.io |
REGISTRY_URL |
docker.io/myusername |
Azure Container Registry
| Secret | Example |
|---|---|
REGISTRY_USERNAME |
ACR username |
REGISTRY_PASSWORD |
ACR password |
REGISTRY_NAME |
myregistry.azurecr.io |
REGISTRY_URL |
myregistry.azurecr.io |
GCP Artifact Registry
| Secret | Example |
|---|---|
GCP_SERVICE_ACCOUNT_KEY |
Base64-encoded service account JSON key |
REGISTRY_NAME |
asia-south1-docker.pkg.dev |
REGISTRY_URL |
asia-south1-docker.pkg.dev/<project>/<repo> |