36 lines (19 loc) · 1.47 KB

Product Recall Streaming Pipeline

This project uses different tools such as kafka, airflow, spark, postgres and docker.

Overview

The data pipeline consists of three main stages:

Data Streaming:
Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system.
Data Processing:
A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.

Orchestration with Airflow:
The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.

Deployment

All components are containerized and managed using Docker and docker-compose, ensuring easy setup, portability, and scalability.