Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka...

Post on 22-May-2020

14 views 0 download

Transcript of Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka...

Distributed systems for stream processing

Apache Kafka and Spark Structured Streaming

Alena Hall - @lenadroid

ü HighScale&BigDataü DistributedSystemsü FunctionalProgrammingü DataScience&MachineLearning

Alena Hall - lenadroid

Ever-increasing

Data

lenadroid

Direct result of some action

lenadroid

Produced as a side effect

lenadroid

Continuous metrics

lenadroid

Reaction

urgent not-so-urgent flexible

lenadroid

Reaction

urgent not-so-urgent flexible

near-real-time~ seconds

real-time~ sub milliseconds

batch~minutes, hours, days, weeks

lenadroid

Event Ingestion Processing & Reaction

real-time micro-batch

batch

lenadroid

Are data workflows flexible enough?

lenadroid

Challenges

Simplicity. Scalability. Reliability

lenadroid

Meet Apache Kafka

lenadroid

Apache Kafka is an open-source stream-

processing software platform developed by the

Apache Software Foundation written in Scala

and Java.

lenadroid

Kafka Brokers

lenadroid

Inside of a Kafka Topic

0 1 2 3 4

0 1 2 3

80 1 2 3 4 5 6 7

lenadroid

Kafka Topic Partition

80 1 2 3 4 5 6 7

lenadroid

Create a Kafka topic

lenadroid

Kafka Producers and Consumers

lenadroid

Systems for stream processing

Kafka

Storm

Spark

Flink

lenadroid

Meet Apache Spark

lenadroid

Apache Spark is a unified analytics engine for large-scale data

processing: batch, streaming, machine learning, graph

computation with access to data in hundreds of sources.

lenadroid

ü Spark SQL and batch processing

ü Stream processing with Spark Streaming and Structured

Streaming

ü * Continuous processing

ü Machine Learning with Mllib

ü Graph computations with GraphX

* Experimental lenadroid

Spark program

lenadroid

How does Spark work?

lenadroid

Sparkapplication(Driver)

ClusterManager

task

task

task

task

task

task

task

task

task

lenadroid

Apache Kafka + Apache Spark

lenadroid

Existing infrastructure and resources

ü Kafka cluster (HDInsight or other)

ü Spark cluster (Azure Databricks workspace, or other)

ü Peered Kafka and Spark Virtual Networks

ü Sources of data: Twitter & Slack

lenadroid

Databricks: Interactive Environment

lenadroid

Example

Processing streams of events from multiple sources with Apache Kafka and Spark

lenadroid

Data sources: external, internal, ...

• Big number of data sources

• Most of the data sources are independent

• Sources of data used for many processing tasks & end-goals

lenadroid

Feedback from Slack

ü Sending messages to Slack

lenadroid

Listener for new Slack messages

ü Messagesunderspecificchannelsü Focusedonaparticulartopicü SenttoaspecificKafkatopic

lenadroid

Receiving events in Kafka topic

ü SparkconsumerforKafkatopicsü SendingonlytopicrelatedmessagestoKafka

lenadroid

Sending Twitter feedback to Kafka

ü GettinglatesttweetsaboutspecifictopictoKafkaü ReceivingthoseeventsfromKafkainSpark

lenadroid

Analyzing feedback in real-time

ü Kafkaisreceivingeventsfrommanysourcesü SentimentanalysisonincomingKafkaeventsü Sentiment<=0.3à #negative-feedback forreviewü Sentiment>=0.9à #positive-feedback channel

lenadroid

Kafka + Spark = Reliable, scalable, durable event ingestion

and efficient stream processing

lenadroid

trigger(Trigger.Continuous("1 second"))

Low (~1 ms) end-to-end latency

At-least-once fault-tolerance guarantees

Not nearly all operations are supported yet

No automatic retries of failed tasks

Needs enough cluster power to operate

lenadroid

EveryXseconds EveryXseconds EveryXseconds

WheneventIsatsource

Wheneventisprocessedtosink

Check-pointingepoch

lenadroid

EveryXseconds EveryXseconds EveryXseconds

WheneventIsatsource

Wheneventisprocessedtosink

Check-pointingepoch

~1ms

lenadroid

aka.ms/eventhubs-kafkalenadroid

Operator

Operators

lenadroid

Thank you!

Apache Kafka: aka.ms/apache-kafka

Apache Spark: aka.ms/apache-sparkEvent stream processing architecture on Azure with Apache Kafka and Spark: aka.ms/kafka-spark-azure

Create HDInsight Kafka cluster using ARM: aka.ms/hdi-kafka-arm

Create Kafka topics in HDInsight: aka.ms/hdi-kafka-topic

lenadroid

Alena Hall - lenadroid

ü Works on Azure at ü Lives in Seattleü F# Software Foundation Board of Trusteesü Organizes @ML4ALLü Program Committee for Lambda World ü Has a channel: /c/AlenaHall