Skip to content

Latest commit

 

History

History
38 lines (24 loc) · 3.55 KB

File metadata and controls

38 lines (24 loc) · 3.55 KB

Stream Analytics

Nowadays we need to deal with streaming data. Streaming data processing becomes a very important issue for big data:

  • We need to ingest a lot of streaming data into big data platforms for later analytics. If you look at big data databases or data lakes, you see that streaming data ingestion is an important issue, such as Hudi Delta Streamer, Druid Kafka Ingestion
  • We need to analyze streaming data on the fly - analytics of big data in motion. Examples are to analyze IoT data, real time logs, and customer's ecommerce transactions

Although streaming analytics has had long history, dealing with big streaming data is challenging, especially when analytics must be done in sub-seconds for million requests.

Key concepts

Key concepts in streaming analytics would be:

  • Data connectors: how can we obtain streaming data? will we use connectors/libraries via standard protocols like MQTT and AMQP? Or will we use powerful advanced, sometimes all-inclusive, message brokers/pubsub systems, like Apache Kafka, Apache Pulsar, or Amazon Kinesis.
  • Windows analytics: streaming analytics often is based on a window of data. A window can be defined by length, time or other ways. Furthermore, data can selected through keys. Which types of windows are suitable? How to define them? If we have a window of data, what kind of analytics we can apply for a window? It is very often based on specific requirements and many experiments.
  • Which engines can we use for executing stream analytics? How do such engines work with existing distributed computing resources to enable fast, reliable stream analytics?
  • How to deal with message delay? faults of processing components? How to ensure that we dont reprocess of messages twice?

Some paths for study

Understanding data from messaging systems

Large-scale messaging systems for big data are complex. There are many such systems that one should be familiar with in order to integrate streaming analytics with such messaging systems: