Skip to content

Latest commit

 

History

History
36 lines (19 loc) · 1.47 KB

File metadata and controls

36 lines (19 loc) · 1.47 KB

Product Recall Streaming Pipeline

Kafka Airflow PySpark Docker

This project uses different tools such as kafka, airflow, spark, postgres and docker.

alt text

Overview

The data pipeline consists of three main stages:

  1. Data Streaming:
    Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system.

  2. Data Processing:
    A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.

image

  1. Orchestration with Airflow:
    The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.

image

Deployment

All components are containerized and managed using Docker and docker-compose, ensuring easy setup, portability, and scalability.