Trivento summercamp fast data 9/9/2016

What is FastData & How the SMACK stack plays a major role by implementing

a Fast Data strategyStavros Kontopoulos

Senior Software Engineer @ Lightbend, M.Sc.

De oude Prodentfabriek

Trivento Summercamp 2016 Amersfoort

Introduction

2

Introduction: Who Am I?

Agenda

A bit of history: Big Data Processing

What is Fast Data?

Batch Systems vs Streaming Systems & Streaming Evolution

Event Log, Message Queues

A Fast Data Architecture

Example Application

3

Last warning...

4

Data Processing

Batch processing: processing done on a bounded dataset.

Stream Processing (Streaming): processing done on an unbounded datasets. Data items are pushed or pulled.

Two categories of systems: batch vs streaming systems.

5

Big Data - The story

Internet scale apps moved data size from Gigabytes to Petabytes.

Once upon a time there were traditional RDBMS like Oracle and Data Warehouses but volume, velocity and variety changed the game.

6


MapReduce was a major breakthrough (Google published the seminal paper in 2004).

Nutch project already had an implementation in 2005

2006 becomes a subproject of Lucene with the name Hadoop.

2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it becomes a top-level apache project.

Hadoop is good for batch processing.


Word Count example - Inverted Index.

8

Split 1

Split N

doc1, doc2 ...

...

doc300, doc100

MAP REDUCE

(w1,1)…(w20,1)

(w41,1)…(w1,1)

Shuffle

(w1, (1,1,1…))...

(w41, (1,1,…))...

(w1, 13)...

(w1, 3)...


Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value store.” changed the DataBase world in 2007.

NoSQL Databases along with general system like Hadoop solve problems cannot be solved with traditional RDBMs.

Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus over more powerful cpus.

9


There is a major shift in the industry as batch processing is not enough any more.

Batch jobs usually take hours if not days to complete, in many applications that is not acceptable.

10


The trend now is near-real time computation which implies streaming algorithms and needs new semantics. Fast Data (data in motion) & Big Data (data at rest) at the same time.

The enterprise needs to get smarter, all major players across industries use ML on top of massive datasets to make better decisions.

11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530 https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg

https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530

https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg


OpsClarity report:92% plan to increase their investment in stream processing applications in the

next year79% plan to reduce or eliminate investment in batch processing32% use real time analysis to power core customer-facing applications44% agreed that it is tedious to correlate issues across the pipeline68% identified lack of experience and underlying complexity of new data

frameworks as their barrier to adoption

http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html

12



13 Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html



14

In OpsClarity report:

● Apache Kafka is the most popular broker technology (ingestion queue)

● HDFS the most used data sink

● Apache Spark is the most popular data processing tool.

Big Data Landscape

15 Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

Big Data System

A Big Data System must have at least the following components at its core:

DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).

Distributed Data processing tool like: Spark, MapReduce, etc.

Tools and services to manage the previous systems.

16

Batch Systems - The Hadoop Ecosystem

17

Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in March 2013.

Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the next-generation replacement for MapReduce.

Image: Lightbend Inc.


Hadoop clusters, the gold standard for big data from ~2008 to the present (started back in 2005).

Strengths:

Lowest CapEx system for Big Data.

Excellent for ingesting and integrating diverse datasets.

Flexible: from classic analytics (aggregations and data warehousing) to machine learning.

18


Weaknesses:

Complex administration.

YARN can’t manage all distributed services.

MapReduce, has poor performance, a difficult programming model, and doesn’t support stream

processing.

19

Analyzing Infinite Data Streams

20

What does it mean to run a SQL query on an unbounded data set.

How should I deal with the late data.

What kind of time measurement should I use? Event-time, Processing time or Ingestion time?

Accuracy of computations on bounded datasets vs on unbounded datasets

Algorithms for streaming computations?


21

Two cases for processing:

Single event processing: event transformation, trigger an alarm on an error event

Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream.


22

Event aggregation introduces the concept of windowing wrt to the notion of time selected:

Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection.

Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second.

System Arrival or Ingestion time (the time that events arrived at the streaming system).

Ideally event time = processing time. Reality is: there is skew.


23

Windows come in different flavors:

Tumbling windows discretize a stream into non-overlapping windows.

Sliding Windows: slide over the stream of data.


24

Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data.

Triggers: decide when the window is evaluated or purged.


25

Given the advances in streaming we can:

Trade-off latency with cost and accuracy

In certain use-cases replace batch processing with streaming


26

Recent advances in Streaming are a result of the pioneer work:

MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803


27

Apache Beam is the open source successor of Google’s DataFlow

It is becoming the standard api for streaming. Provides the advanced semantics needed for the current needs in streaming applications.

Streaming Systems Architecture

28

User provides a graph of computations through a high level API where data flows on the edges of this graph. Each vertex its an operator which executes a user operation-computation. For example: stream.map().keyBy()...

Operators can run in multiple instances and preserve state (unlike batch processing where we have immutable datasets).

State can be persisted and restored in the presence of failures.

Analyzing Infinite Data Streams - Flink Example

29

sealed trait SensorType { def stype: String }case object TemperatureSensor extends SensorType { val stype = "TEMP" }case object HumiditySensor extends SensorType { val stype = "HUM" }

case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)

https://github.com/skonto/trivento-summercamp-2016


30

class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int, val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] { final val serialVersionUID = 1L @volatile var isRunning = true var counter = 1 var timestamp = 0 val randomGen = Random

require(numberOfSensors > 0) require(numberOfElements >= -1)

lazy val initialReading: Double = { sensorType match { case TemperatureSensor => 27.0 case HumiditySensor => 0.75 } }

override def run(ctx: SourceContext[SensorData]): Unit = {

val counterCondition = { if(numberOfElements == -1) { x: Int => isRunning } else { x: Int => isRunning && counter <= x } }

while (counterCondition(numberOfElements)) { Thread.sleep(10) // send sensor data every 10 milliseconds

val dataId = randomGen.nextInt(numberOfSensors) + 1 val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp) ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs timestamp = timestamp + 1

if (timestamp % watermarkTag == 0) { // watermark should be mod 0 ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds } counter = counter + 1 } }

override def cancel(): Unit = { // No cleanup needed isRunning = false }}

The Source



31

object SensorSimple { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // set default env parallelism for all operators env.setParallelism(2) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val numberOfSensors = 2 val watermarkTag = 10 val numberOfElements = 1000

val sensorDataStream = env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))

sensorDataStream.writeAsText("inputData.txt")

val windowedKeyed = sensorDataStream .keyBy(data => data.sensorId) .timeWindow(Time.milliseconds(10))

windowedKeyed.max("value") .writeAsText("outputMaxValue.txt")

windowedKeyed.apply(new SensorAverage()) .writeAsText("outputAverage.txt") env.execute("Sensor Data Simple Statistics") }}

class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] { def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = { if (input.nonEmpty) { val average = input.map(_.value).sum / input.size out.collect(input.head.copy(value = average)) } }}

The Job



32

Operator 1 Operator 2

Watermark 1 (10) 0 3 6 2

7 5849

Operators run the operations defined by the graph of the streaming computation. Example Operators (KeyBy, Map, FlatMap etc)

Two instances of the same operator with parallelism 2 (previous example).

Watermark N (10*N) ..

....

....

..

....

....

....

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...

time

file1 file2

window 2window 1

Streaming vs Batch Systems

33

Metric Batch Streaming

Data size per job TB to PB MB to TB (in flight)

Time between data arrival and processing

Many minutes to hours Microseconds to minutes

Job execution times Minutes to hours Microseconds to minutes

Event Log as the Core Abstraction

34

Logging is everywhere:

Write-ahead log (WAL) in databases for durability.

The distributed log can be seen as the data structure which models the problem of consensus. Reduction of the problem of making multiple machines all do the same thing to the problem of a distributed consistent log implementation. The log feeds processes input (state-machine replication model).

In real-time streaming implementations use logs as a natural mean for recording events as they are processed according to the computation graph. This assists in implementing consistency algorithms like ABS in Apache Flink and other functionality for all major streaming engines like Google DataFlow, Apache Storm, Apache Samza.

Event Log as the Core Abstraction

35

Architecture patterns enablers:

Event Sourcing (ES)

Command-query Responsibility Segregation (CQRS)

Message Queues as the Integration Tool

36

FIFO data structures, the natural way to process logs.

Organise user data in topics, each topic has its own queue.

Benefits

Decouples consumers from producers

Arbitrary number of producers and consumers are supported

Easy to use

Kafka is the most popular implementation for Big Data Systems.

Message Queues - Kafka

37

Kafka is the most popular implementation for Big Data Systems.

Kafka is a “...distributed, partitioned, replicated commit log service”

No deletion of messages on read, allows replay of data.

Production tested at LinkedIn at scale.

Delivery/Processing Semantics

38

In distributed systems failure is part of the game. What semantics I can achieve for message delivery?

at-most-once delivery: for each message sent, that message is delivered zero or one times.

at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it, such that at least one succeeds; messages may be duplicated but not lost.

exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.

In theory it is impossible to have exactly once delivery.

In practice we might care more for exactly-once state changes and at-least once delivery. Example: Keeping state at some operator of the streaming graph.

The SMACK Stack

39

Technologies which combined together deliver high performing streaming systems:

Spark

Mesos

Akka

Cassandra

Kafka

A Fast Data Architecture using the SMACK stack

40Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016

Example IoT Application

41

Adding ML support to the IoT Application

42

Anomaly detectionVoice interfaceImage classificationRecommendationsAutomatic tuning of the IoT environment

What about Lambda Architecture?

43

Eventually we want to replace it, it is more of a traditional model.

Problems

Hard to maintain

Duplication of code & systems

Special systems for unifying views

In certain cases we can replace it with streaming based architectures.

Streaming Implementations Status

44

Apache Spark: Structured Streaming in v2 starts the improvement of the streaming engine. Still based on micro-batches but event-time support was added.

Apache Flink: SQL API supported from v0.9 and on. Still important features are on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.

Picking the Right Tool for Streaming

45

Criteria to choose:Processing semantics (strong consistency is needed for correctness)Latency guarantees

Deployment / Operation

Ecosystem build around it

Complex event processing (CEP)

Batch & Streaming API support

Community & Support

Picking the Right Tool for Streaming

46

Some tipsPick Flink if you need sub-second latency and Beam supportPick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for

training models, has mature deployment capabilities. Pick Gearpump for materializing Akka Streams in a distributed fashion.Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed

solution out of the box). (Check Confluent Platform for many useful tools around Kafka).

Questions?

Thank you!

47

References

48

Watermarks: Time and progress in streaming dataflow and beyond: Big Data Conference - Strata + Hadoop World, May 31 - June 3, 2016, London, UK

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

MillWheel: Fault-Tolerant Stream Processing at Internet Scale Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things

Asynchronous Distributed Snapshots for Distributed Dataflows | the morning paper State machine replication - Wikipedia, the free encyclopedia The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing | the morning paper

The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016How Apache Flink™ enables new streaming applications – data Artisans

http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49605

http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49605

http://research.google.com/pubs/pub43864.html



http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/

https://en.wikipedia.org/wiki/State_machine_replication

https://blog.acolyer.org/2015/08/18/the-dataflow-model-a-practical-approach-to-balancing-correctness-latency-and-cost-in-massive-scale-unbounded-out-of-order-data-processing/

https://blog.acolyer.org/2015/08/18/the-dataflow-model-a-practical-approach-to-balancing-correctness-latency-and-cost-in-massive-scale-unbounded-out-of-order-data-processing/

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/

Trivento summercamp fast data 9/9/2016

Software

Transcript of Trivento summercamp fast data 9/9/2016