The State of Spark

The State of SparkAnd Where We’re Going Next

Matei Zaharia

Community Growth

Project HistorySpark started as research project in 2009Open sourced in 2010

»1st version was 1600 LOC, could run Wikipedia demo

Growing community sinceEntered Apache Incubator in June 2013

Development CommunityWith over 100 developers and 25 companies, one of the most active communities in big data

Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)Past 6 months: more active devs than Hadoop

MapReduce!

Development CommunityHealthy across the whole ecosystem

Release Growth

Spark 0.6:- Java API, Maven, standalone mode

- 17 contributors

Sept ‘13Feb ‘13Oct ‘12

Spark 0.7:- Python API, Spark Streaming

- 31 contributors

Spark 0.8:- YARN, MLlib, monitoring UI

- 67 contributors

YARN support (Yahoo!)Columnar compression in Shark (Yahoo!)Fair scheduling (Intel)Metrics reporting (Intel, Quantifind)New RDD operators (Bizo, ClearStory)Scala 2.10 support (Imaginea)

Some Community Contributions

Conferences

AMP Camp 1 (Aug 2012)

AMP Camp 2 (Aug 2013)

Spark Summit (Nov 2013)

Projects Built on Spark

Spark Streamin

g(real-time)

GraphX(graph)

…Shark(SQL)

MLbase(machine learning)

BlinkDB

What’s Next?

Our ViewWhile big data tools have advanced a lot, they are still far too difficult to tune and use

Goal: design big data systems that are as powerful & seamless as those for small data

Current PrioritiesStandard libraries

Deployment

Out-of-the-box usability

Enterprise use +

Standard LibrariesWhile writing K-means in 30 lines is great, it’s even better to call it from a library!Spark’s MLlib and GraphX will be standard libraries supported by core developers

»MLlib in Spark 0.8 with 7 algorithms»GraphX coming soon»Both operate directly on RDDs

Standard Librariesval rdd: RDD[Array[Double]] = ...val model = KMeans.train(rdd, k = 10)

val graph = Graph(vertexRDD, edgeRDD)val ranks = PageRank.run(graph, iters = 10)

Standard LibrariesBeyond these libraries, Databricks is investing heavily in higher-level projects

Spark Streaming:easier 24/7 operation and optimizations coming in 0.9

Shark:calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12Goal: a complete and interoperable stack

DeploymentWant Spark to easily run anywhere

Spark 0.8: much improved YARN, EC2 supportSpark 0.8.1: support for YARN 2.2SIMR: launch Spark in MapReduce clusters as a Hadoop job (no installation needed!)

»For experimenting; see talk by Ahir

Monitoring and metrics (0.8)Better support for large # of tasks (0.8.1)High availability for standalone mode (0.8.1)External hashing & sorting (0.9)

Ease of Use

Long-term: remove need to tune beyond defaults

Next ReleasesSpark 0.8.1 (this month)

»YARN 2.2, standalone mode HA, optimized shuffle, broadcast & result fetching

Spark 0.9 (Jan 2014)»Scala 2.10 support, configuration system,

Spark Streaming improvements

What Makes Spark Unique?

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLabStorm

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

Spark’s ApproachInstead of specializing, generalize MapReduceto support new apps in same engineTwo changes (general task DAG & data sharing) are enough to express previous models!Unification has big benefits

»For the engine»For users

Code Size

Reduce

la (SQ

Giraph

20000400006000080000

100000120000140000

non-test, non-example source lines

Code Size

Reduce

la (SQ

Giraph

20000400006000080000

100000120000140000

Streaming

Code Size

Reduce

la (SQ

Giraph

20000400006000080000

100000120000140000

StreamingShark*

* also calls into Hive

Code Size

Reduce

la (SQ

Giraph

20000400006000080000

100000120000140000

Streaming

GraphXShark*

* also calls into Hive

Performance

1015202530354045 Hi

)Graph

Streaming

What it Means for UsersSeparate frameworks:

…HDFS read

HDFS writeET

L HDFS read

HDFS writetra

in HDFS

readHDFS writequ

HDFS read ET

Spark: Interactiveanalysis

Combining Processing Types

val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)

val model = KMeans.train(points, 10)

sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

From Scala:

Combining Processing Types

GENERATE KMeans(tweet_locations) AS TABLE tweet_clusters

// Scala table generating function (TGF):object KMeans { @Schema(spec = “x double, y double, cluster int”) def apply(points: RDD[(Double, Double)]) = { ... }}

From SQL (in Shark 0.8.1):

ConclusionNext challenge in big data will be complex and low-latency applicationsSpark offers a unified engine to tackle and combine these appsBest strength is the community: enjoy Spark Summit!

ContributorsAaron DavidsonAlexander PivovarovAli GhodsiAmeet TalwalkarAndre ShumacherAndrew AshAndrew PsaltisAndrew XiaAndrey KouznetsovAndy FengAndy KonwinskiAnkur DaveAntonio LupherBenjamin HindmanBill ZhaoCharles ReissChris MattmannChristoph GrothausChristopher NguyenChu TongCliff EngleCody KoeningerDavid McCauleyDenny BritzDmitriy LyubimovEdison TungEric ZhangErik van Oosten

Ethan JewettEvan ChanEvan SparksEwen Cheslack-PostavaFabrizio MiloFernand PajotFrank DaiGavin LiGinger SmithGiovanni DelussuGrace HuangHaitao YaoHaoyuan LiHarold LimHarvey FengHenry MilnerHenry SaputraHiral PatelHolden KarauIan BussImran RashidIsmael JumaJames PhillpottsJason DaiJerry ShaoJey KottalamJoseph E. GonzalezJosh Rosen

Justin MaKalpit ShahKaren FengKarthik TungaKay OusterhoutKody KoenigerKonstantin BoudnikLee Moon SooLian ChengLiang-Chi HsiehMarc MercerMarek KolodziejMark HamstraMatei ZahariaMatthew TaylorMichael HeuerMike PottsMikhail BautinMingfei ShiMosharaf ChowdhuryMridul MuralidharanNathan HowellNeal WigginsNick PentreathOlivier GriselPatrick WendellPaul CavallaroPaul Ruan

Peter SankauskasPierre BorckmansPrabeesh K.Prashant SharmaRam SriharshaRavi PandyaRay RacineReynold XinRichard BenkovskyRichard McKinleyRohit RaiRoman TkalenkoRyan LeCompteS. KumarSean McNamaraShane HuangShivaram VenkataramanStephen HabermanTathagata DasThomas DudziakThomas GravesTimothy HunterTyson HamiltonVadim ChekanWu ZemingXinghao Pan

The State of Spark

Documents

Transcript of The State of Spark

EVALUATION OF THE SPARK CURRICULUM IN … OF THE SPARK CURRICULUM IN COMMUNITY SCHOOLS IN ... Basic Education (ZBEC) curriculum. 2. ... SPARK curriculum and SPARK manual revised …

Distributed Computing with Spark - Stanford Universitystanford.edu/~rezab/sparkclass/slides/reza_introtalk.pdfSpark computing engine Machine Learning Example Current State of Spark

Spark Evaluation – 0.9 · Spark Evaluation – 0.9.7 ... as highlighting shortcomings in the current state of the art to identify areas for future improvement. 1.2 Spark ... enabling

Open Sources Used in Cisco Spark User State Service · Open Source Used In Cisco Spark User State Service April 2018 1 Open Source Used In Cisco Spark User State Service April 2018

Spark Streaming : Dealing with State

Faster Batch Processing with Apache Spark for …...Faster Batch Processing with Apache Spark for Investment Management Software Provider Our client offers state-of-the-art deal management,

Spark streaming state of the union

Mechatronics - Spark Mindasparkminda.com/wp-content/uploads/2018/09/Spark... · reach to new geographies and customers to provide global solutions locally State of art technical center

Introduction to Spark - University of Arkansas · Introduction to Spark. Outlines • A brief history of Spark ... • Spark will not begin to execute until it sees an action •

Efficient State Management With Spark 2.0 And Scale-Out Databases

HBaseConEast2016: HBase and Spark, State of the Art

Spark SQL | Apache Spark

Spark Summit Europe Wrap Up and TASM State of the Community

Spark streaming , Spark SQL

Budapest Spark Meetup - Basics of Spark coding

NIC Kerala State Centre SPARK-User Manual Page 1 of 32€¦ · NIC Kerala State Centre SPARK-User Manual Page 29 of 32 NIC-KLSC-ASG4-SPARK-UM Version 2.0 Release: 10th June 2012 5.1.1.12.4

Effects of Spark Energy on Spark Plug Fault Recognition in ...

Spark, spark streaming & tachyon

The State of Spark And Where Were Going Next Matei Zaharia.

Spark Application Carousel: Highlights of Several Applications Built with Spark