Kafka pours and Spark resolves

Post on 21-Apr-2017

316 views 5 download

Transcript of Kafka pours and Spark resolves

Kafka pours and Spark

resolves!

Alexey Zinovyev, Java/BigData Trainer in EPAM

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With Spark since 2014

With EPAM since 2015

3Spark Streaming from Zinoviev Alexey

Contacts

E-mail : Alexey_Zinovyev@epam.com

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

4Spark Streaming from Zinoviev Alexey

Spark

Family

5Spark Streaming from Zinoviev Alexey

Spark

Family

6Spark Streaming from Zinoviev Alexey

Pre-summary

• Before RealTime

• Spark + Cassandra

• Sending messages with Kafka

• DStream Kafka Consumer

• Structured Streaming in Spark 2.1

• Kafka Writer in Spark 2.2

7Spark Streaming from Zinoviev Alexey

< REAL-TIME

8Spark Streaming from Zinoviev Alexey

Modern Java in 2016Big Data in 2014

9Spark Streaming from Zinoviev Alexey

Batch jobs produce reports. More and more..

10Spark Streaming from Zinoviev Alexey

But customer can wait forever (ok, 1h)

11Spark Streaming from Zinoviev Alexey

Big Data in 2017

12Spark Streaming from Zinoviev Alexey

Machine Learning EVERYWHERE

13Spark Streaming from Zinoviev Alexey

Data Lake in promotional brochure

14Spark Streaming from Zinoviev Alexey

Data Lake in production

15Spark Streaming from Zinoviev Alexey

Simple Flow in Reporting/BI systems

16Spark Streaming from Zinoviev Alexey

Let’s use Spark. It’s fast!

17Spark Streaming from Zinoviev Alexey

MapReduce vs Spark

18Spark Streaming from Zinoviev Alexey

MapReduce vs Spark

19Spark Streaming from Zinoviev Alexey

Simple Flow in Reporting/BI systems with Spark

20Spark Streaming from Zinoviev Alexey

Spark handles last year logs with ease

21Spark Streaming from Zinoviev Alexey

Where can we store events?

22Spark Streaming from Zinoviev Alexey

Let’s use Cassandra to store events!

23Spark Streaming from Zinoviev Alexey

Let’s use Cassandra to read events!

24Spark Streaming from Zinoviev Alexey

Cassandra

CREATE KEYSPACE mySpace WITH replication = {'class':

'SimpleStrategy', 'replication_factor': 1 };

USE test;

CREATE TABLE logs

( application TEXT,

time TIMESTAMP,

message TEXT,

PRIMARY KEY (application, time));

25Spark Streaming from Zinoviev Alexey

Cassandra

to Spark

val dataSet = sqlContext

.read

.format("org.apache.spark.sql.cassandra")

.options(Map( "table" -> "logs", "keyspace" -> "mySpace"

))

.load()

dataSet

.filter("message = 'Log message'")

.show()

26Spark Streaming from Zinoviev Alexey

Simple Flow in Pre-Real-Time systems

27Spark Streaming from Zinoviev Alexey

Spark cluster over Cassandra Cluster

28Spark Streaming from Zinoviev Alexey

More events every second!

29Spark Streaming from Zinoviev Alexey

SENDING MESSAGES

30Spark Streaming from Zinoviev Alexey

Your Father’s Messaging System

31Spark Streaming from Zinoviev Alexey

Your Father’s Messaging System

32Spark Streaming from Zinoviev Alexey

Your Father’s Messaging System

InitialContext ctx = new InitialContext();

QueueConnectionFactory f =

(QueueConnectionFactory)ctx.lookup(“qCFactory"); QueueConnection con =

f.createQueueConnection();

con.start();

33Spark Streaming from Zinoviev Alexey

KAFKA

34Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

35Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

36Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

• supports Publish-Subscribe model

37Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

• supports Publish-Subscribe model

• persists messages on disk

38Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

• supports Publish-Subscribe model

• persists messages on disk

• replicates within the cluster (integrated with Zookeeper)

39Spark Streaming from Zinoviev Alexey

The main benefits of Kafka

Scalability with zero down time

Zero data loss due to replication

40Spark Streaming from Zinoviev Alexey

Kafka Cluster consists of …

• brokers (leader or follower)

• topics ( >= 1 partition)

• partitions

• partition offsets

• replicas of partition

• producers/consumers

41Spark Streaming from Zinoviev Alexey

Kafka Components with topic “messages” #1

Producer

Thread #1

Producer

Thread #2

Producer

Thread #3

Topic: MessagesData

Data

Data

Zookeeper

42Spark Streaming from Zinoviev Alexey

Kafka Components with topic “messages” #2

Producer

Thread #1

Producer

Thread #2

Producer

Thread #3

Topic: Messages

Part #1

Part #2

Data

Data

Data

Broker #1

Zookeeper

Part #1

Part #2

Leader

Leader

43Spark Streaming from Zinoviev Alexey

Kafka Components with topic “messages” #3

Producer

Thread #1

Producer

Thread #2

Producer

Thread #3

Topic: Messages

Part #1

Part #2

Data

Data

Data

Broker #1

Broker #2

Zookeeper

Part #1

Part #1

Part #2

Part #2

44Spark Streaming from Zinoviev Alexey

Why do we need Zookeeper?

45Spark Streaming from Zinoviev Alexey

Kafka

Demo

46Spark Streaming from Zinoviev Alexey

REAL TIME WITH

DSTREAMS

47Spark Streaming from Zinoviev Alexey

RDD Factory

48Spark Streaming from Zinoviev Alexey

From socket to console with DStreams

49Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

50Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

51Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

52Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

53Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

54Spark Streaming from Zinoviev Alexey

Kafka as a main entry point for Spark

55Spark Streaming from Zinoviev Alexey

DStreams

Demo

56Spark Streaming from Zinoviev Alexey

How to avoid DStreams with RDD-like API?

57Spark Streaming from Zinoviev Alexey

SPARK 2.2 DISCUSSION

58Spark Streaming from Zinoviev Alexey

Continuous Applications

59Spark Streaming from Zinoviev Alexey

Continuous Applications cases

• Updating data that will be served in real time

• Extract, transform and load (ETL)

• Creating a real-time version of an existing batch job

• Online machine learning

60Spark Streaming from Zinoviev Alexey

The main concept of Structured Streaming

You can express your streaming computation the

same way you would express a batch computation

on static data.

61Spark Streaming from Zinoviev Alexey

Batch

Spark 2.2

// Read JSON once from S3

logsDF = spark.read.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.write.parquet("s3://out")

62Spark Streaming from Zinoviev Alexey

Real

Time

Spark 2.2

// Read JSON continuously from S3

logsDF = spark.readStream.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.writeStream.parquet("s3://out")

.start()

63Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

64Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

65Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

66Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Don’t forget

to start

Streaming

67Spark Streaming from Zinoviev Alexey

Unlimited Table

68Spark Streaming from Zinoviev Alexey

WordCount with Structured Streaming [Complete Mode]

69Spark Streaming from Zinoviev Alexey

Kafka -> Structured Streaming -> Console

70Spark Streaming from Zinoviev Alexey

Kafka To

Console

Demo

71Spark Streaming from Zinoviev Alexey

OPERATIONS

72Spark Streaming from Zinoviev Alexey

You can …

• filter

• sort

• aggregate

• join

• foreach

• explain

73Spark Streaming from Zinoviev Alexey

Operators

Demo

74Spark Streaming from Zinoviev Alexey

How it works?

75Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset

76Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical Plan

77Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical Plan

78Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

79Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Selected

Physical Plan

Planner

80Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Selected

Physical Plan

81Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Selected

Physical Plan

Incremental

#1

Incremental

#2

Incremental

#3

82Spark Streaming from Zinoviev Alexey

Incremental Execution: Planner polls

Planner

83Spark Streaming from Zinoviev Alexey

Incremental Execution: Planner runs

Planner

Offset: 1 - 5Incremental

#1

R

U

N

Count: 5

84Spark Streaming from Zinoviev Alexey

Incremental Execution: Planner runs #2

Planner

Offset: 1 - 5Incremental

#1

Incremental

#2Offset: 6 - 9

R

U

N

Count: 5

Count: 4

85Spark Streaming from Zinoviev Alexey

Aggregation with State

Planner

Offset: 1 - 5Incremental

#1

Incremental

#2Offset: 6 - 9

R

U

N

Count: 5

Count: 5+4=9

5

86Spark Streaming from Zinoviev Alexey

DataSet.explain()

== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]

:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as

bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],

functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])

: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0

+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe

+- Scan CsvRelation(----)

87Spark Streaming from Zinoviev Alexey

What’s the difference between

Complete and Append

output modes?

88Spark Streaming from Zinoviev Alexey

COMPLETE, APPEND &

UPDATE

89Spark Streaming from Zinoviev Alexey

There are two main modes and one in future

• append (default)

90Spark Streaming from Zinoviev Alexey

There are two main modes and one in future

• append (default)

• complete

91Spark Streaming from Zinoviev Alexey

There are two main modes and one in future

• append (default)

• complete

• update [in dreams]

92Spark Streaming from Zinoviev Alexey

Aggregation with watermarks

93Spark Streaming from Zinoviev Alexey

SOURCES & SINKS

94Spark Streaming from Zinoviev Alexey

Spark Streaming is a brick in the Big Data Wall

95Spark Streaming from Zinoviev Alexey

Let's save to Parquet files

96Spark Streaming from Zinoviev Alexey

Let's save to Parquet files

97Spark Streaming from Zinoviev Alexey

Let's save to Parquet files

98Spark Streaming from Zinoviev Alexey

Can we write to Kafka?

99Spark Streaming from Zinoviev Alexey

Nightly Build

100Spark Streaming from Zinoviev Alexey

Kafka-to-Kafka

101Spark Streaming from Zinoviev Alexey

J-K-S-K-S-C

102Spark Streaming from Zinoviev Alexey

Pipeline

Demo

103Spark Streaming from Zinoviev Alexey

I didn’t find sink/source for XXX…

104Spark Streaming from Zinoviev Alexey

Console

Foreach

Sink

import org.apache.spark.sql.ForeachWriter

val customWriter = new ForeachWriter[String] {

override def open(partitionId: Long, version: Long) = true

override def process(value: String) = println(value)

override def close(errorOrNull: Throwable) = {}

}

stream.writeStream

.queryName(“ForeachOnConsole")

.foreach(customWriter)

.start

105Spark Streaming from Zinoviev Alexey

Pinch of wisdom

• check checkpointLocation

• don’t use MemoryStream

• think about GC pauses

• be careful about nighty builds

• use .groupBy.count() instead count()

• use console sink instead .show() function

106Spark Streaming from Zinoviev Alexey

We have no ability…

• join two streams

• work with update mode

• make full outer join

• take first N rows

• sort without pre-aggregation

107Spark Streaming from Zinoviev Alexey

Roadmap 2.2

• Support other data sources (not only S3 + HDFS)

• Transactional updates

• Dataset is one DSL for all operations

• GraphFrames + Structured MLLib

• KafkaWriter

• TensorFrames

108Spark Streaming from Zinoviev Alexey

IN CONCLUSION

109Spark Streaming from Zinoviev Alexey

Final point

Scalable Fault-Tolerant Real-Time Pipeline with

Spark & Kafka

is ready for usage

110Spark Streaming from Zinoviev Alexey

A few papers about Spark Streaming and Kafka

Introduction in Spark + Kafka

http://bit.ly/2mJjE4i

111Spark Streaming from Zinoviev Alexey

Source code from Demo

Spark Tutorial

https://github.com/zaleslaw/Spark-Tutorial

112Spark Streaming from Zinoviev Alexey

Contacts

E-mail : Alexey_Zinovyev@epam.com

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

113Spark Streaming from Zinoviev Alexey

Any questions?