D19 Kafka, Spark StreamingHow to guarantee delivery producer -> Kafka Idempotent Delivery - Kafka...

52
CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017 Doc 19 Kafka, Spark Streaming Nov 28, 2017 Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http:// www.opencontent.org/openpub/) license defines the copyright on this document.

Transcript of D19 Kafka, Spark StreamingHow to guarantee delivery producer -> Kafka Idempotent Delivery - Kafka...

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 19 Kafka, Spark Streaming Nov 28, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/openpub/) license defines the copyright on this document.

Kafka Performance

2

On three cheap machines

Setup

6 machines

Intel Xeon 2.5 GHz processor with six cores Six 7200 RPM SATA drives 32GB of RAM 1Gb Ethernet

Kafka cluster - 3 machines Zookeeper - 1 machine Generating load - 3 machines

Test uses 100 byte messages

http://tinyurl.com/ng2h9uh

6 partitions

Producer Throughput

3

Records/Second MB/sec

1 producer, no replication of partition 821,557 78.3

1 producer, 3 async replication 786,980 75.1

1 producer, 3 sync replication 421,823 40.2

3 producers, 3 async replication 2,025,032 193.0

7,290,115,200 Records/Hour

Producer Throughput Versus Stored Data

4

Does the amount of stored data affect performance?

Persisting messages is O(1)

Consumer Throughput

5

Records/Second MB/sec

1 consumer 940,521 89.7

3 consumers - same topic 2,615,968 249.5

6 partitions, 3x async replicated

End-to-end Latency

6

ProducerConsumerKafka

System

2ms - median 3ms - 99th percentile 14ms - 99.9 percentile

What Makes Kafka Fast - Partitions

7

Allows concurrent writes & reads on same topic

What Makes Kafka Fast - Messages

8

Messages Binary format Batched Compressed

bit 0~2: 0: no compression 1: gzip 2: snappy 3: lz4

Producer convert message into binary

Kafka treats message as bits

Consumer needs to know how to unpack message

Producer supplies schema Adds schema to ZooKeeper

What Makes Kafka Fast - Use the Filesystem

9

Linear read/writes are very fast on hard drives JBOD configuration six 7200rpm SATA drives 600MB/sec

Modern OS (Linux) Heavily optimized read-ahead, write-behind Will use all free memory for disk cache

JVM Memory overhead of objects is high Java garbage collection becomes slow as heap data increases

So Write data to disk Use as little memory as possible Let OS use nearly all memory for disk cache

28-30GB cache on 32GB machine

What Makes Kafka Fast - sendfile

10

Normal path to send a file on network

1 OS copies file Disk -> page cache in kernel space

2 Application copies data: page cache -> user-space

3 Application copies data: user-space -> socket buffer in kernel space

4 OS copies data: socket buffer -> NIC buffer

Using sendfile to send a file on network

1. OS copies file: Disk -> NIC buffer

Using OS pagecache + sendfile means when consumers are mostly caught up Files served from cache

Message Delivery Semantics

11

How to guarantee delivery producer -> Kafka

Partition1

Partition3

Kafka Server 1Partition1 Follower

Partition4

Kafka Server 2

M1

Producer

M1

When do we consider message delivered?

When leader get the message When all replicated partitions get the message

Message Delivery Semantics

12

How to guarantee delivery producer -> Kafka

Partition1

Partition3

Kafka Server 1Partition1 Follower

Partition4

Kafka Server 2

M1

Producer

M1

Producer does not know if message was delivered Producer needs to resend message

Prior to 0.11.0 would create duplicate message in log

Message Delivery Semantics

13

How to guarantee delivery producer -> Kafka

Idempotent Delivery - Kafka 0.11.0

Each producer Has an ID Adds sequence number to messages

Kafka server checks for duplicates

Transactions - Kafka 0.11.0 Producer can send messages to multiple topics All either succeed or all fail

Message Delivery Semantics

14

How to guarantee delivery producer -> Kafka

Producer can specify Be notified when leader and all followers have message Be notified when leader has message Provide a time out Receive no notifications

Replication

15

Each partition has a single leader and zero or more followers (slaves)

Followers are consumers of the leader partition Allows for batch reads

Live or in-sync node Node maintains its session with ZooKeeper

If follower must not be too far behind leader replica.lag.time.max.ms

Failed node One that is not in-sync

When a Leader Dies

16

In-sync replicas (ISR) Follower partitions that are caught-up with leader

ZooKeeper maintains the ISR set for each partition

When partition leader fails Nodes in ISR vote on new leader

A machine may handle 1,000's of topics and even more partitions

Message Delivery Semantics

17

How to guarantee delivery producer -> Kafka

Producer can specify Be notified when leader and all followers have message Be notified when leader has message Provide a time out Receive no notifications

Producer specifies ack parameter 0 - no acknowledgement 1 - acknowledgement when leader has message all (-1) - acknowledgement when all ISR partitions received the message

Zookeeper

18

Because coordinating distributed systems is a Zoo

Distributed hierarchical key-value store

For large distributed systems Distributed configuration service, Synchronization service, Naming registry

Zookeeper

19

Running Kafka

20

Download and unpack Kafka

Start Zookeeper (it comes with Kafka)

bin/zookeeper-server-start.sh config/zookeeper.properties

In Kafka directory

Start Kafka Server

bin/kafka-server-start.sh config/server.properties

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

Create a topic

You can configure producers to auto-create topics when publish to new topic

Running Kafka

21

bin/kafka-topics.sh --list --zookeeper localhost:2181

List of topics

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Send some messages

>this is a test>this is a message>Hi mom

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

Start a client

this is a testthis is a messageHi mom

Running Kafka

22

bin/kafka-topics.sh --list --zookeeper localhost:2181

List of topics

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Send some messages

>this is a test>this is a message>Hi mom

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

Start a client

this is a testthis is a messageHi mom

Running Kafka

23

Setting up a multi-broker cluster

cp config/server.properties config/server-1.properties

cp config/server.properties config/server-2.properties

Edit configuration so can run on same machine

bin/kafka-server-start.sh config/server-1.properties &

Start the servers

bin/kafka-server-start.sh config/server-2.properties &

Running Kafka

24

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Create a new topic with a replication factor of three

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic

Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0

Get information about topic

In-sync

Send Records in Code

25

Records have Topic Key Value

Records are serialized

Kafka as serializers for base types

If want key or value to be compound type Need to provide serialized Consumer needs access to serializer Either

Attach to each message Register with ZooKeeper

Running KafkaProducer Java Methods

26

initTransactions()

beginTransaction()

abortTransaction()

commitTransaction()

flush()

send(ProducerRecord<K,V> record)

send(ProducerRecord<K,V> record, Callback callback)

KafkaProducer(java.util.Properties properties)

KafkaProducer(java.util.Properties properties, Serializer<K> keySerializer, Serializer<V> valueSerializer)

Simple Producer

27

Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("acks", "all"); props.put("retries", 0); props.put("batch.size", 16384); props.put("linger.ms", 1); props.put("buffer.memory", 33554432); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

Producer<String, String> producer = new KafkaProducer<>(props); for (int i = 0; i < 100; i++) producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));

producer.close();

With Transactions

28

Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("transactional.id", "my-transactional-id"); Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer()); producer.initTransactions();

try { producer.beginTransaction(); for (int i = 0; i < 100; i++) producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i))); producer.commitTransaction(); } catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) { producer.close(); } catch (KafkaException e) { producer.abortTransaction(); } producer.close();

Consumer

29

Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("group.id", "test"); props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "1000"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(Arrays.asList("foo", "my-topic")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value()); }

Spark Streaming & Spark Structured Streaming

30

Spark Streaming Stream data to RDDs

Spark Structured Streaming Stream data to RDDs

Spark Streaming

31

Spark Structured Streaming

32

Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming

Streaming computation same as batch computation on static data

Spark SQL engine Runs computation incrementally & continuously Updates the final result as streaming data continues to arrive

Spark Structured Streaming

33

Spark Structured Streaming - Example

34

Stream of words Spark Word Count

cat dog dog dog

owl catdog owl

T1T3 T2

Word Count Example

35

import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.DataFrame import org.apache.spark.sql.Dataset import org.apache.spark.sql.streaming.StreamingQuery

object NetworkExample { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()

Word Count Example

36

val lines:DataFrame = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()

// Split the lines into words import spark.implicits._ val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count val wordCounts = words.groupBy("value").count()

val query:StreamingQuery = wordCounts.writeStream .outputMode("complete") .format("console") .start()

query.awaitTermination() }

import spark.implicits._

37

import spark.implicits._ val words = lines.as[String].flatMap(_.split(" "))

Includes encoders for

Primitive types (Int, String, etc) and Product types (case classes)

Output Modes

38

.outputMode(OutputMode.Complete())

val query:StreamingQuery = wordCounts.writeStream .outputMode("complete") .format("console") .start()

Append Only new rows in DataFrame are written to output

Complete All rows are written to output

Append Only rows that were updated will be written to output

query.awaitTermination()

39

Runs until query.stop() Exception occurs

Output

40

Complete

41

cat dog dog dog

owl cat

dog owl

Input

Output---------------- Batch: 1 ---------------- +-----+-----+ |value|count| +-----+-----+ | dog| 3| | cat| 2| | owl| 1| +-----+-----+ ---------------- Batch: 2 ---------------- +-----+-----+ |value|count| +-----+-----+ | dog| 3| | cat| 2| | owl| 1| +-----+-----+

------------------------------------------- Batch: 3 ------------------------------------------- +-----+-----+ |value|count| +-----+-----+ | dog| 4| | cat| 2| | owl| 2| +-----+-----+

Complete

42

cat dog dog dog

owl cat

dog owl

Input

Output---------------- Batch: 1 ---------------- +-----+-----+ |value|count| +-----+-----+ | dog| 3| | cat| 2| | owl| 1| +-----+-----+ ---------------- Batch: 2 ---------------- +-----+-----+ |value|count| +-----+-----+ | dog| 3| | cat| 2| | owl| 1| +-----+-----+

------------------------------------------- Batch: 3 ------------------------------------------- +-----+-----+ |value|count| +-----+-----+ | dog| 4| | cat| 2| | owl| 2| +-----+-----+

43

Input Sources

44

File source - Reads files written in a directory as a stream of data File formats are text, csv, json, parquet

Kafka source - Poll data from Kafka Kafka versions 0.10.0 or higher

Socket source (for testing) Reads UTF8 text data from a socket connection

Template for Kafka as Source

45

val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]

46

val spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()

val lines:DataFrame = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "test") .load()

import spark.implicits._ val words = lines.selectExpr( "CAST(value AS STRING)").as[String] val wordCounts:DataFrame = words.groupBy("value").count()

val query:StreamingQuery = wordCounts.writeStream .outputMode(OutputMode.Append()) .format("console") .start() query.awaitTermination()

Event-time and Late Data

47

Event-time Messages contain timestamp Can use timestamp as event time

Event-time Message may come out of order

Window Operations on Event Time

48

Can compute aggregations over sliding event-time window

Specify Window size When update window location

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count()

Window size

Update time

49

With Late Data

50

Watermarking

51

val windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word") .count()

Can specify how late data can be and still be used

Unsupported Operations

52

Multiple streaming aggregations

Limit and take first N rows are not supported on streaming Datasets.

Distinct operations on streaming Datasets are not supported.

Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.

Any kind of joins between two streaming Datasets is not yet supported.

count() use ds.groupBy().count() which returns a streaming Dataset containing a running count.

foreach() - Use ds.writeStream.foreach(...)

show() - Use the console sink