Spark Summit 2016
Who am I?
Looking for a Machine Learning Summer Intern!
bit.ly/nzzml
Spark Summit 2016
Trend #1:Spark 2.0
Trend #1: Spark 2.0
Trend #2:RDD’s, DF’s, DS’s
RDD’s, DF’s, DS’s ... Why?
RDD’s, DF’s, DS’s ... Why?
RDD’s, DF’s, DS’s ... Why?
RDD’s, DF’s, DS’s ... Why?
+
Prefer DF’s & DS’s over RDD’s!
RDD’s, DF’s, DS’s ... Why?
Demo ...
Trend #3:Streaming 2.0
“The simplest way to do streaming analytics, is when you don’t have to worry about streaming.”
Streaming 2.0
Streaming 2.0
Demo ...
Streaming 2.0val StructuredStream = sqlContext.read.format(“json”).stream(src_path)
StructuredStream.select($"constant_Value").groupBy($"constant_Value").count.write.format("parquet").save("/tmp/out/value.parquet").startStream()
Trend #4:GraphFrames
Trend #4: GraphFrames
Trend #4: GraphFrames
http://graphframes.github.io/
Demo ...
Trend #5:SparkR is catching up
Trend #5: SparkR is catching up
Trend #6:Deep-Learning
DNNs are coming: Watch it closely!
Insight #1:Big Players ...
… big community
Insight #2:Same issues everywhere ...
The user- mailinglist is your best friend!
Insight #3:Stream, Compute, Dump
Use Spark (streaming) what it’s meant for: realtime computation, not serving!
Insight #4:4 Best practices
GroupByKey! GroupByKey?
Circumvent Skew by “Salting”
Key: Foo
Salted Key: Foo + random(1,saltDim)
Think about resource allocation!--num-executors
--executor-cores
--executor-memory
? !
You know, window functions ...
first value, last value, rank,
Looking for a Machine Learning Summer
Intern!
bit.ly/nzzml
Checkout TechTuesday!
meetup.com/Tech-Tuesday-Zurich
Top Related