STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame •...

34
WHAT’S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016

Transcript of STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame •...

Page 1: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

WHAT’S NEW IN SPARK 2.0:STRUCTURED STREAMING AND DATASETS

Andrew Ray

StampedeCon 2016

Page 2: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

2 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science and engineering.

Page 3: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

3 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

ANDREW RAY

• Contributor to Apache Spark• Hadoop ecosystem guru• Senior Data Engineer @ SVDS• Prev. Data Sci. @ Walmart• PhD Mathematics @ UNL• From St. Louis

@collegeisfun

Page 4: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

44 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.

• Spark Refresher• RDD• DataFrame• Streaming

• What’s New in 2.0• Datasets• Structured Streaming

AGENDA

Page 5: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

@SVDataScience

REFRESHERApache Spark

Page 6: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

6 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

RDD

• Redundant Distibuted Dataset• Collection of typed objects• Low level API

• map• flatMap• reduceByKey• Etc …

• Lazy evaluation

Page 7: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

7 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val lines = sc.textFile("/a/dir/of/files/")

val counts = lines.flatMap(_.split(" "))

.map(x => (x, 1))

.reduceByKey(_ + _)

counts.take(10)

WORD COUNT: RDD

Page 8: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

8 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

DATAFRAME

• Collection of tabular data• Named columns with specified data types• Higher level API• Mix with SQL• Operations not checked until analysis.

Page 9: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

9 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val lines = sqlCtx.read.text("/a/dir/of/files/”)

val counts = lines.select(

explode(split($"value"," ")).as("word")

)

.groupBy("word")

.count()

counts.show()

WORD COUNT: DATAFRAME

Page 10: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

10 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

SPARK STREAMING

• Micro batch• RDD for values in each iteration

Page 11: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

11 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val ssc = new StreamingContext(sc, Seconds(5))val lines = ssc.textFileStream("/a/dir/of/files/")

val counts = lines.flatMap(_.split(" "))

.map(x => (x, 1))

.reduceByKey(_ + _)

counts.print()ssc.start()

WORD COUNT: STREAMING

Page 12: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

12 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val ssc = new StreamingContext(sc, Seconds(5))ssc.checkpoint("/somewhere/durable/")val lines = ssc.textFileStream("/a/dir/of/files/")val counts = lines.flatMap(_.split(" "))

.map(x => (x, 1))

.updateStateByKey {

(values: Seq[Int], state: Option[Int]) =>Some(values.sum + state.getOrElse(0))

}counts.print()ssc.start()

WORD COUNT: STREAMING (FIXED)

Page 13: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

@SVDataScience

WHAT’S NEWSpark 2.0

Page 14: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

14 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

NEW IN 2.0

• Stable Dataset API• Alpha of Structured Streaming• Some other stuff• 1000’s of commits

Page 15: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

15 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

SPARK 2.0 MIGRATION

• New entry point – SparkSession• Replaces SparkContext and SQLContext• In shell: spark

• type DataFrame = Dataset[Row]

• Java code change: DataFrame => Dataset<Row>• Default build is with Scala 2.11

Page 16: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

16 @SVDataScience

PART I

Dataset

Page 17: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

17 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

DATASET

• Collection of typed objects• Primitives• Scala case class• Java bean class

• Use Spark Encoders• Operate on without deserializing to objects

• Compile time correctness checks• Optimized by Catalyst

Page 18: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

18 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val lines = spark.read.textFile("/a/dir/of/files/")

val counts = lines.flatMap(_.split(" "))

.groupByKey(identity)

.count()

counts.show()

WORD COUNT: DATASET

Page 19: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

19 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

DATASET NOTES

• Two sets of methods• Typed (return Dataset)

• groupByKey• Untyped (return DataFrame)

• groupBy• Easy to convert DataFrame to Dataset of your object

• df.as[Person]• df.as(Encoders.bean(Person.class))

• Python and R only have DataFrame

Page 20: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

20 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

case class Person(name: String, age: Option[Long])

val path = "examples/src/main/resources/people.json"val people = spark.read.json(path).as[Person]

def toId(p: Person): String = p.name + p.age.getOrElse(99)

val ids = people.map(toId)

ids.show()

DATASET SECOND EXAMPLE

Page 21: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

21 @SVDataScience

PART II

Structured Streaming

Page 22: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

22 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

STRUCTURED STREAMING

• Extension of DataFrame/Dataset to streaming• Input is unbounded append only table

T=3

T=2T=1

Page 23: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

23 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val df = spark.readStream.text("/a/dir/of/files/")

val counts = df.as[String].flatMap(_.split(" ")).groupBy("value").count()

val query = counts.writeStream.outputMode("complete").format("console").start()

WORD COUNT: STRUCTURED STREAMING

Page 24: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

24 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

val df = spark.readStream.text("/a/dir/of/files/")

val counts = df.as[String].flatMap(_.split(" ")).groupBy("value").count()

val query = counts.writeStream.outputMode("complete").format("console").start()

counts.show()

WORD COUNT: BATCH

Page 25: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

25 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

NOTES

• Streams are DataFrames/Datasets• Enables code reuse between batch and streaming• No schema inferance• Limitations enforced at Analysis

• Aggregation chains• Distinct operations• Some outer joins to static datasets• Joins of streams

• Batch duration optional

Page 26: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

26 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

HOW IT WORKS

T=1

AggregationBuffers Result

T=2

T=3

Page 27: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

27 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

WINDOWS

• Don’t need to be a multiple of the batch duration• Not just processing time• Possible to do event time windows• Just another column

Page 28: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

28 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

import org.apache.spark.sql.types.StructTypeval schema = new StructType().add("url", "string")

.add("event_time", "timestamp")val events = spark.readStream.schema(schema).json("events/")

val counts = events.groupBy($"url", window($"event_time", "1 hour").as("w")).count().orderBy($"w", $"count".desc)

val query = counts.writeStream.outputMode("complete").format("console").start()

EVENT TIME WINDOW

Page 29: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

29 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

{"url":"google.com", "event_time":"2016-07-21 23:38:04"}{"url":"google.com", "event_time":"2016-07-21 23:44:04"}{"url":"google.com", "event_time":"2016-07-21 22:27:04"}{"url":"yahoo.com", "event_time":"2016-07-21 23:10:04"}...

INPUT

Page 30: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

30 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

DISCARD DELAY

(not implemented yet)• Discard highly delayed events• Will help limit active state to bounded size

Event time windows:

Processing time:

Time

Page 31: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

31 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

SOURCE & SINK OPTIONS

Currently very limited• Source

• Files• Socket

• Sink• Parquet – append output mode only• For each – custom code• Console• Memory

Page 32: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

32 @SVDataScience

Don’t use structured streaming in production

Page 33: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

33 © 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience

RESOURCES

Spark Docs• spark.apache.org/docs/latest/Spark Examples• github.com/apache/spark/tree/master/examplesStructured Streaming Umbrella JIRA• issues.apache.org/jira/browse/SPARK-8360

Page 34: STRUCTURED STREAMING AND DATASETS - Home ... Spark Refresher • RDD • DataFrame • Streaming • What’s New in 2.0 • Datasets • Structured Streaming AGENDA @SVDataScience

THANK YOU

Yes, we’re [email protected]

Andrew Ray@collegeisfun

[email protected]