Skip to content

Latest commit

 

History

History
116 lines (89 loc) · 6.23 KB

File metadata and controls

116 lines (89 loc) · 6.23 KB

PyroCluster: Real-Time Big Data Analytics Platform

PyroCluster is a powerful, scalable, and flexible real-time analytics platform built entirely on a modern open-source data stack. It's designed to ingest, process, and visualize massive streams of data with very low latency, providing immediate insights on a live dashboard.

This project serves as a reference architecture for integrating best-in-class technologies to solve complex big data challenges.


🚀 Core Components

The cluster is built on a "best-of-breed" philosophy, where each component is chosen for its specific strengths in the data lifecycle.

  • Foundation:
    • Apache Hadoop (HDFS): Provides a robust, distributed data lake for long-term storage and batch processing.
    • Apache YARN: Acts as the cluster's brain, efficiently managing resources for all processing jobs.
  • Central Message Bus:
    • Apache Kafka: Serves as the high-throughput, fault-tolerant central nervous system, decoupling data ingestion from data processing.
  • Data Ingestion: (Choose the right tool for the source)
    • Apache NiFi: For visual, flow-based programming to manage complex data routing and transformation from diverse sources.
    • Apache Flume: For high-volume, reliable log data collection and aggregation into HDFS or Kafka.
    • Logstash: For advanced parsing and enrichment of logs and machine-generated data.
  • Real-Time Processing:
    • Apache Spark: The core processing engine. Using Spark Streaming, it performs real-time analytics, aggregations, and transformations on data from Kafka.
  • Visualization & Dashboarding:
    • Apache Superset: A modern BI tool used to create interactive, real-time dashboards by directly querying the processed data from Spark.

🏝️ System Architecture

The architecture is designed around a central Kafka bus, which enables a clean separation of concerns and massive scalability.

Data Flow: Data Sources ➡️ Ingestion (NiFi / Flume / Logstash) ➡️ Apache Kafka (Message Bus) ➡️ Apache Spark (Real-Time Processing on YARN) ➡️ Apache Superset (Live Dashboard)

                                                   ┌────────────────┐
                                                   │ Apache Superset  │
                                                   │  (Dashboard)     │
                                                   └─────────┬──────┘
                                                             │ (JDBC/ODBC)
┌─────────┐   ┌─────────────────────┐   ┌───────────┬───────┐
│           │   │  Ingestion Layer           │   │      Processing Layer      │
│ Data        ├➡️ │  (NiFi / Flume / Logstash) ├➡️ │      (on YARN Cluster)   │
│ Sources    │   │                         │   │                          │
└─────────┘   └───────────┬───────┘   └─────────┬───────┘
                                    │                                    │
                                    ┼                                    ┼
                             ┌───────└──────┘               ┌───────└──────┘
                             │ Apache Kafka  │               │  Apache Spark │
                             │ (Data Bus)    ├────────────➡️│  (Streaming)  │
                             └────────────┘               └────────────┘
                                     │
                                     │ (Cold Storage for Batch Analytics)
                                     ┼
                             ┌───────└──────┘
                             │  Hadoop HDFS  │
                             └────────────┘

✨ Features

  • Real-Time Analytics: Process and analyze data streams with latency in seconds.
  • Scalable: Built on technologies designed for horizontal scaling, capable of handling petabytes of data.
  • Decoupled: The Kafka-centric design allows components to be independently managed, updated, and scaled.
  • Flexible Ingestion: Use a variety of tools to collect data from any source, including logs, APIs, IoT devices, and databases.
  • Unified Batch & Real-Time: The architecture supports both real-time streaming (hot path) and deep batch analytics (cold path) on HDFS.
  • Interactive Visualization: Go beyond simple reports with drill-down, interactive dashboards in Apache Superset.

🚀 Getting Started

This repository contains the configuration and reference architecture for building the PyroCluster.

(This section would be populated with specific setup instructions.)

Prerequisites

  • A multi-node server environment (Cloud or On-Premise)
  • Docker and Docker Compose (for containerized deployment)
  • Ansible or similar configuration management tool (recommended for production)

Installation

  1. Clone the repository:
    git clone https://github.com/chmj/pyrocluster.git
    cd pyrocluster
  2. Configure Environment Variables:
    • Set the necessary hostnames, IP addresses, and service configurations in the provided .env file.
  3. Deploy the Cluster:
    • Follow the detailed guides in the /docs directory to deploy Hadoop, Kafka, Spark, and other services.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for bugs, feature requests, or improvements to the architecture.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📜 License

Distributed under the MIT License. See LICENSE for more information.


Author

  • Charles Majola
  • GitHub: @chmj