Trivento summercamp fast data 9/9/2016

48
What is FastData & How the SMACK stack plays a major role by implementing a Fast Data strategy Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc. De oude Prodentfabriek Trivento Summercamp 2016 Amersfoort

Transcript of Trivento summercamp fast data 9/9/2016

Page 1: Trivento summercamp fast data 9/9/2016

What is FastData & How the SMACK stack plays a major role by implementing

a Fast Data strategyStavros Kontopoulos

Senior Software Engineer @ Lightbend, M.Sc.

De oude Prodentfabriek

Trivento Summercamp 2016 Amersfoort

Page 2: Trivento summercamp fast data 9/9/2016

Introduction

2

Introduction: Who Am I?

Page 3: Trivento summercamp fast data 9/9/2016

Agenda

A bit of history: Big Data Processing

What is Fast Data?

Batch Systems vs Streaming Systems & Streaming Evolution

Event Log, Message Queues

A Fast Data Architecture

Example Application

3

Page 4: Trivento summercamp fast data 9/9/2016

Last warning...

4

Page 5: Trivento summercamp fast data 9/9/2016

Data Processing

Batch processing: processing done on a bounded dataset.

Stream Processing (Streaming): processing done on an unbounded datasets. Data items are pushed or pulled.

Two categories of systems: batch vs streaming systems.

5

Page 6: Trivento summercamp fast data 9/9/2016

Big Data - The story

Internet scale apps moved data size from Gigabytes to Petabytes.

Once upon a time there were traditional RDBMS like Oracle and Data Warehouses but volume, velocity and variety changed the game.

6

Page 7: Trivento summercamp fast data 9/9/2016

Big Data - The story

MapReduce was a major breakthrough (Google published the seminal paper in 2004).

Nutch project already had an implementation in 2005

2006 becomes a subproject of Lucene with the name Hadoop.

2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it becomes a top-level apache project.

Hadoop is good for batch processing.

Page 8: Trivento summercamp fast data 9/9/2016

Big Data - The story

Word Count example - Inverted Index.

8

Split 1

Split N

doc1, doc2 ...

...

doc300, doc100

MAP REDUCE

(w1,1)…(w20,1)

(w41,1)…(w1,1)

Shuffle

(w1, (1,1,1…))...

(w41, (1,1,…))...

(w1, 13)...

(w1, 3)...

Page 9: Trivento summercamp fast data 9/9/2016

Big Data - The story

Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value store.” changed the DataBase world in 2007.

NoSQL Databases along with general system like Hadoop solve problems cannot be solved with traditional RDBMs.

Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus over more powerful cpus.

9

Page 10: Trivento summercamp fast data 9/9/2016

Big Data - The story

There is a major shift in the industry as batch processing is not enough any more.

Batch jobs usually take hours if not days to complete, in many applications that is not acceptable.

10

Page 11: Trivento summercamp fast data 9/9/2016

Big Data - The story

The trend now is near-real time computation which implies streaming algorithms and needs new semantics. Fast Data (data in motion) & Big Data (data at rest) at the same time.

The enterprise needs to get smarter, all major players across industries use ML on top of massive datasets to make better decisions.

11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530 https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg

Page 12: Trivento summercamp fast data 9/9/2016

Big Data - The story

OpsClarity report:92% plan to increase their investment in stream processing applications in the

next year79% plan to reduce or eliminate investment in batch processing32% use real time analysis to power core customer-facing applications44% agreed that it is tedious to correlate issues across the pipeline68% identified lack of experience and underlying complexity of new data

frameworks as their barrier to adoption

http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

12

Page 13: Trivento summercamp fast data 9/9/2016

Big Data - The story

13 Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

Page 14: Trivento summercamp fast data 9/9/2016

Big Data - The story

14

In OpsClarity report:

● Apache Kafka is the most popular broker technology (ingestion queue)

● HDFS the most used data sink

● Apache Spark is the most popular data processing tool.

Page 15: Trivento summercamp fast data 9/9/2016

Big Data Landscape

15 Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

Page 16: Trivento summercamp fast data 9/9/2016

Big Data System

A Big Data System must have at least the following components at its core:

DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).

Distributed Data processing tool like: Spark, MapReduce, etc.

Tools and services to manage the previous systems.

16

Page 17: Trivento summercamp fast data 9/9/2016

Batch Systems - The Hadoop Ecosystem

17

Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in March 2013.

Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the next-generation replacement for MapReduce.

Image: Lightbend Inc.

Page 18: Trivento summercamp fast data 9/9/2016

Batch Systems - The Hadoop Ecosystem

Hadoop clusters, the gold standard for big data from ~2008 to the present (started back in 2005).

Strengths:

Lowest CapEx system for Big Data.

Excellent for ingesting and integrating diverse datasets.

Flexible: from classic analytics (aggregations and data warehousing) to machine learning.

18

Page 19: Trivento summercamp fast data 9/9/2016

Batch Systems - The Hadoop Ecosystem

Weaknesses:

Complex administration.

YARN can’t manage all distributed services.

MapReduce, has poor performance, a difficult programming model, and doesn’t support stream

processing.

19

Page 20: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

20

What does it mean to run a SQL query on an unbounded data set.

How should I deal with the late data.

What kind of time measurement should I use? Event-time, Processing time or Ingestion time?

Accuracy of computations on bounded datasets vs on unbounded datasets

Algorithms for streaming computations?

Page 21: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

21

Two cases for processing:

Single event processing: event transformation, trigger an alarm on an error event

Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream.

Page 22: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

22

Event aggregation introduces the concept of windowing wrt to the notion of time selected:

Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection.

Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second.

System Arrival or Ingestion time (the time that events arrived at the streaming system).

Ideally event time = processing time. Reality is: there is skew.

Page 23: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

23

Windows come in different flavors:

Tumbling windows discretize a stream into non-overlapping windows.

Sliding Windows: slide over the stream of data.

Page 24: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

24

Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data.

Triggers: decide when the window is evaluated or purged.

Page 25: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

25

Given the advances in streaming we can:

Trade-off latency with cost and accuracy

In certain use-cases replace batch processing with streaming

Page 26: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

26

Recent advances in Streaming are a result of the pioneer work:

MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803

Page 27: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams

27

Apache Beam is the open source successor of Google’s DataFlow

It is becoming the standard api for streaming. Provides the advanced semantics needed for the current needs in streaming applications.

Page 28: Trivento summercamp fast data 9/9/2016

Streaming Systems Architecture

28

User provides a graph of computations through a high level API where data flows on the edges of this graph. Each vertex its an operator which executes a user operation-computation. For example: stream.map().keyBy()...

Operators can run in multiple instances and preserve state (unlike batch processing where we have immutable datasets).

State can be persisted and restored in the presence of failures.

Page 29: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams - Flink Example

29

sealed trait SensorType { def stype: String }case object TemperatureSensor extends SensorType { val stype = "TEMP" }case object HumiditySensor extends SensorType { val stype = "HUM" }

case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)

https://github.com/skonto/trivento-summercamp-2016

Page 30: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams - Flink Example

30

class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int, val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] { final val serialVersionUID = 1L @volatile var isRunning = true var counter = 1 var timestamp = 0 val randomGen = Random

require(numberOfSensors > 0) require(numberOfElements >= -1)

lazy val initialReading: Double = { sensorType match { case TemperatureSensor => 27.0 case HumiditySensor => 0.75 } }

override def run(ctx: SourceContext[SensorData]): Unit = {

val counterCondition = { if(numberOfElements == -1) { x: Int => isRunning } else { x: Int => isRunning && counter <= x } }

while (counterCondition(numberOfElements)) { Thread.sleep(10) // send sensor data every 10 milliseconds

val dataId = randomGen.nextInt(numberOfSensors) + 1 val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp) ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs timestamp = timestamp + 1

if (timestamp % watermarkTag == 0) { // watermark should be mod 0 ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds } counter = counter + 1 } }

override def cancel(): Unit = { // No cleanup needed isRunning = false }}

The Source

https://github.com/skonto/trivento-summercamp-2016

Page 31: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams - Flink Example

31

object SensorSimple { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // set default env parallelism for all operators env.setParallelism(2) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val numberOfSensors = 2 val watermarkTag = 10 val numberOfElements = 1000

val sensorDataStream = env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))

sensorDataStream.writeAsText("inputData.txt")

val windowedKeyed = sensorDataStream .keyBy(data => data.sensorId) .timeWindow(Time.milliseconds(10))

windowedKeyed.max("value") .writeAsText("outputMaxValue.txt")

windowedKeyed.apply(new SensorAverage()) .writeAsText("outputAverage.txt") env.execute("Sensor Data Simple Statistics") }}

class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] { def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = { if (input.nonEmpty) { val average = input.map(_.value).sum / input.size out.collect(input.head.copy(value = average)) } }}

The Job

https://github.com/skonto/trivento-summercamp-2016

Page 32: Trivento summercamp fast data 9/9/2016

Analyzing Infinite Data Streams - Flink Example

32

Operator 1 Operator 2

Watermark 1 (10) 0 3 6 2

7 5849

Operators run the operations defined by the graph of the streaming computation. Example Operators (KeyBy, Map, FlatMap etc)

Two instances of the same operator with parallelism 2 (previous example).

Watermark N (10*N) ..

....

....

..

....

....

....

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...

time

file1 file2

window 2window 1

Page 33: Trivento summercamp fast data 9/9/2016

Streaming vs Batch Systems

33

Metric Batch Streaming

Data size per job TB to PB MB to TB (in flight)

Time between data arrival and processing

Many minutes to hours Microseconds to minutes

Job execution times Minutes to hours Microseconds to minutes

Page 34: Trivento summercamp fast data 9/9/2016

Event Log as the Core Abstraction

34

Logging is everywhere:

Write-ahead log (WAL) in databases for durability.

The distributed log can be seen as the data structure which models the problem of consensus. Reduction of the problem of making multiple machines all do the same thing to the problem of a distributed consistent log implementation. The log feeds processes input (state-machine replication model).

In real-time streaming implementations use logs as a natural mean for recording events as they are processed according to the computation graph. This assists in implementing consistency algorithms like ABS in Apache Flink and other functionality for all major streaming engines like Google DataFlow, Apache Storm, Apache Samza.

Page 35: Trivento summercamp fast data 9/9/2016

Event Log as the Core Abstraction

35

Architecture patterns enablers:

Event Sourcing (ES)

Command-query Responsibility Segregation (CQRS)

Page 36: Trivento summercamp fast data 9/9/2016

Message Queues as the Integration Tool

36

FIFO data structures, the natural way to process logs.

Organise user data in topics, each topic has its own queue.

Benefits

Decouples consumers from producers

Arbitrary number of producers and consumers are supported

Easy to use

Kafka is the most popular implementation for Big Data Systems.

Page 37: Trivento summercamp fast data 9/9/2016

Message Queues - Kafka

37

Kafka is the most popular implementation for Big Data Systems.

Kafka is a “...distributed, partitioned, replicated commit log service”

No deletion of messages on read, allows replay of data.

Production tested at LinkedIn at scale.

Page 38: Trivento summercamp fast data 9/9/2016

Delivery/Processing Semantics

38

In distributed systems failure is part of the game. What semantics I can achieve for message delivery?

at-most-once delivery: for each message sent, that message is delivered zero or one times.

at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it, such that at least one succeeds; messages may be duplicated but not lost.

exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.

In theory it is impossible to have exactly once delivery.

In practice we might care more for exactly-once state changes and at-least once delivery. Example: Keeping state at some operator of the streaming graph.

Page 39: Trivento summercamp fast data 9/9/2016

The SMACK Stack

39

Technologies which combined together deliver high performing streaming systems:

Spark

Mesos

Akka

Cassandra

Kafka

Page 40: Trivento summercamp fast data 9/9/2016

A Fast Data Architecture using the SMACK stack

40Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016

Page 41: Trivento summercamp fast data 9/9/2016

Example IoT Application

41

Page 42: Trivento summercamp fast data 9/9/2016

Adding ML support to the IoT Application

42

Anomaly detectionVoice interfaceImage classificationRecommendationsAutomatic tuning of the IoT environment

Page 43: Trivento summercamp fast data 9/9/2016

What about Lambda Architecture?

43

Eventually we want to replace it, it is more of a traditional model.

Problems

Hard to maintain

Duplication of code & systems

Special systems for unifying views

In certain cases we can replace it with streaming based architectures.

Page 44: Trivento summercamp fast data 9/9/2016

Streaming Implementations Status

44

Apache Spark: Structured Streaming in v2 starts the improvement of the streaming engine. Still based on micro-batches but event-time support was added.

Apache Flink: SQL API supported from v0.9 and on. Still important features are on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.

Page 45: Trivento summercamp fast data 9/9/2016

Picking the Right Tool for Streaming

45

Criteria to choose:Processing semantics (strong consistency is needed for correctness)Latency guarantees

Deployment / Operation

Ecosystem build around it

Complex event processing (CEP)

Batch & Streaming API support

Community & Support

Page 46: Trivento summercamp fast data 9/9/2016

Picking the Right Tool for Streaming

46

Some tipsPick Flink if you need sub-second latency and Beam supportPick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for

training models, has mature deployment capabilities. Pick Gearpump for materializing Akka Streams in a distributed fashion.Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed

solution out of the box). (Check Confluent Platform for many useful tools around Kafka).

Page 47: Trivento summercamp fast data 9/9/2016

Questions?

Thank you!

47

Page 48: Trivento summercamp fast data 9/9/2016

References

48

Watermarks: Time and progress in streaming dataflow and beyond: Big Data Conference - Strata + Hadoop World, May 31 - June 3, 2016, London, UK

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

MillWheel: Fault-Tolerant Stream Processing at Internet Scale Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things

Asynchronous Distributed Snapshots for Distributed Dataflows | the morning paper State machine replication - Wikipedia, the free encyclopedia The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing | the morning paper

The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016How Apache Flink™ enables new streaming applications – data Artisans