The State of Spark
description
Transcript of The State of Spark
![Page 1: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/1.jpg)
The State of SparkAnd Where We’re Going Next
Matei Zaharia
![Page 2: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/2.jpg)
Community Growth
![Page 3: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/3.jpg)
Project HistorySpark started as research project in 2009Open sourced in 2010
»1st version was 1600 LOC, could run Wikipedia demo
Growing community sinceEntered Apache Incubator in June 2013
![Page 4: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/4.jpg)
Development CommunityWith over 100 developers and 25 companies, one of the most active communities in big data
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)Past 6 months: more active devs than Hadoop
MapReduce!
![Page 5: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/5.jpg)
Development CommunityHealthy across the whole ecosystem
![Page 6: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/6.jpg)
Release Growth
Spark 0.6:- Java API, Maven, standalone mode
- 17 contributors
Sept ‘13Feb ‘13Oct ‘12
Spark 0.7:- Python API, Spark Streaming
- 31 contributors
Spark 0.8:- YARN, MLlib, monitoring UI
- 67 contributors
![Page 7: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/7.jpg)
YARN support (Yahoo!)Columnar compression in Shark (Yahoo!)Fair scheduling (Intel)Metrics reporting (Intel, Quantifind)New RDD operators (Bizo, ClearStory)Scala 2.10 support (Imaginea)
Some Community Contributions
![Page 8: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/8.jpg)
Conferences
AMP Camp 1 (Aug 2012)
AMP Camp 2 (Aug 2013)
Spark Summit (Nov 2013)
0
100
200
300
400
500
Atte
ndee
s
![Page 9: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/9.jpg)
Projects Built on Spark
Spark
Spark Streamin
g(real-time)
GraphX(graph)
…Shark(SQL)
MLbase(machine learning)
BlinkDB
![Page 10: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/10.jpg)
What’s Next?
![Page 11: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/11.jpg)
Our ViewWhile big data tools have advanced a lot, they are still far too difficult to tune and use
Goal: design big data systems that are as powerful & seamless as those for small data
![Page 12: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/12.jpg)
Current PrioritiesStandard libraries
Deployment
Out-of-the-box usability
Enterprise use +
![Page 13: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/13.jpg)
Standard LibrariesWhile writing K-means in 30 lines is great, it’s even better to call it from a library!Spark’s MLlib and GraphX will be standard libraries supported by core developers
»MLlib in Spark 0.8 with 7 algorithms»GraphX coming soon»Both operate directly on RDDs
![Page 14: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/14.jpg)
Standard Librariesval rdd: RDD[Array[Double]] = ...val model = KMeans.train(rdd, k = 10)
val graph = Graph(vertexRDD, edgeRDD)val ranks = PageRank.run(graph, iters = 10)
![Page 15: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/15.jpg)
Standard LibrariesBeyond these libraries, Databricks is investing heavily in higher-level projects
Spark Streaming:easier 24/7 operation and optimizations coming in 0.9
Shark:calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12Goal: a complete and interoperable stack
![Page 16: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/16.jpg)
DeploymentWant Spark to easily run anywhere
Spark 0.8: much improved YARN, EC2 supportSpark 0.8.1: support for YARN 2.2SIMR: launch Spark in MapReduce clusters as a Hadoop job (no installation needed!)
»For experimenting; see talk by Ahir
![Page 17: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/17.jpg)
Monitoring and metrics (0.8)Better support for large # of tasks (0.8.1)High availability for standalone mode (0.8.1)External hashing & sorting (0.9)
Ease of Use
Long-term: remove need to tune beyond defaults
![Page 18: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/18.jpg)
Next ReleasesSpark 0.8.1 (this month)
»YARN 2.2, standalone mode HA, optimized shuffle, broadcast & result fetching
Spark 0.9 (Jan 2014)»Scala 2.10 support, configuration system,
Spark Streaming improvements
![Page 19: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/19.jpg)
What Makes Spark Unique?
![Page 20: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/20.jpg)
Big Data Systems Today
MapReduce
Pregel
Dremel
GraphLabStorm
Giraph
Drill Tez
Impala
S4 …
Specialized systems(iterative, interactive and
streaming apps)
General batchprocessing
![Page 21: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/21.jpg)
Spark’s ApproachInstead of specializing, generalize MapReduceto support new apps in same engineTwo changes (general task DAG & data sharing) are enough to express previous models!Unification has big benefits
»For the engine»For users
Spark
Stre
amin
gGr
aphX
…
Shar
k
MLb
ase
![Page 22: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/22.jpg)
Code Size
Hadoo
p Map
Reduce
Storm
(Stre
aming
)
Impa
la (SQ
L)
Giraph
(Grap
h)Sp
ark0
20000400006000080000
100000120000140000
non-test, non-example source lines
![Page 23: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/23.jpg)
Code Size
Hadoo
p Map
Reduce
Storm
(Stre
aming
)
Impa
la (SQ
L)
Giraph
(Grap
h)Sp
ark0
20000400006000080000
100000120000140000
non-test, non-example source lines
Streaming
![Page 24: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/24.jpg)
Code Size
Hadoo
p Map
Reduce
Storm
(Stre
aming
)
Impa
la (SQ
L)
Giraph
(Grap
h)Sp
ark0
20000400006000080000
100000120000140000
non-test, non-example source lines
StreamingShark*
* also calls into Hive
![Page 25: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/25.jpg)
Code Size
Hadoo
p Map
Reduce
Storm
(Stre
aming
)
Impa
la (SQ
L)
Giraph
(Grap
h)Sp
ark0
20000400006000080000
100000120000140000
non-test, non-example source lines
Streaming
GraphXShark*
* also calls into Hive
![Page 26: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/26.jpg)
Performance
05
1015202530354045 Hi
veIm
pala
(d
isk)
Impa
la
(mem
)Sh
ark
(disk
)Sh
ark
(mem
)
Resp
onse
Tim
e (s
)
SQL0
5
10
15
20
25
30
Hado
opGi
raph
Grap
hX
Res
pons
e Ti
me
(min
)Graph
0
5
10
15
20
25
30
35
Stor
mSp
ark
Thro
ughp
ut (
MB/
s/no
de)
Streaming
![Page 27: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/27.jpg)
What it Means for UsersSeparate frameworks:
…HDFS read
HDFS writeET
L HDFS read
HDFS writetra
in HDFS
readHDFS writequ
e ry
HDFS
HDFS read ET
Ltra
in qu
e ry
Spark: Interactiveanalysis
![Page 28: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/28.jpg)
Combining Processing Types
val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)
val model = KMeans.train(points, 10)
sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
From Scala:
![Page 29: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/29.jpg)
Combining Processing Types
GENERATE KMeans(tweet_locations) AS TABLE tweet_clusters
// Scala table generating function (TGF):object KMeans { @Schema(spec = “x double, y double, cluster int”) def apply(points: RDD[(Double, Double)]) = { ... }}
From SQL (in Shark 0.8.1):
![Page 30: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/30.jpg)
ConclusionNext challenge in big data will be complex and low-latency applicationsSpark offers a unified engine to tackle and combine these appsBest strength is the community: enjoy Spark Summit!
![Page 31: The State of Spark](https://reader036.fdocuments.us/reader036/viewer/2022081507/56816468550346895dd65300/html5/thumbnails/31.jpg)
ContributorsAaron DavidsonAlexander PivovarovAli GhodsiAmeet TalwalkarAndre ShumacherAndrew AshAndrew PsaltisAndrew XiaAndrey KouznetsovAndy FengAndy KonwinskiAnkur DaveAntonio LupherBenjamin HindmanBill ZhaoCharles ReissChris MattmannChristoph GrothausChristopher NguyenChu TongCliff EngleCody KoeningerDavid McCauleyDenny BritzDmitriy LyubimovEdison TungEric ZhangErik van Oosten
Ethan JewettEvan ChanEvan SparksEwen Cheslack-PostavaFabrizio MiloFernand PajotFrank DaiGavin LiGinger SmithGiovanni DelussuGrace HuangHaitao YaoHaoyuan LiHarold LimHarvey FengHenry MilnerHenry SaputraHiral PatelHolden KarauIan BussImran RashidIsmael JumaJames PhillpottsJason DaiJerry ShaoJey KottalamJoseph E. GonzalezJosh Rosen
Justin MaKalpit ShahKaren FengKarthik TungaKay OusterhoutKody KoenigerKonstantin BoudnikLee Moon SooLian ChengLiang-Chi HsiehMarc MercerMarek KolodziejMark HamstraMatei ZahariaMatthew TaylorMichael HeuerMike PottsMikhail BautinMingfei ShiMosharaf ChowdhuryMridul MuralidharanNathan HowellNeal WigginsNick PentreathOlivier GriselPatrick WendellPaul CavallaroPaul Ruan
Peter SankauskasPierre BorckmansPrabeesh K.Prashant SharmaRam SriharshaRavi PandyaRay RacineReynold XinRichard BenkovskyRichard McKinleyRohit RaiRoman TkalenkoRyan LeCompteS. KumarSean McNamaraShane HuangShivaram VenkataramanStephen HabermanTathagata DasThomas DudziakThomas GravesTimothy HunterTyson HamiltonVadim ChekanWu ZemingXinghao Pan