Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated&...

25
Matei Zaharia, Mosharaf Chowdhury,Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, LanguageIntegrated Cluster Computing UC BERKELEY www.sparkproject.org

Transcript of Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated&...

Page 1: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Matei  Zaharia,  Mosharaf  Chowdhury,  Tathagata  Das,  Ankur  Dave,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica  

Spark  Fast,  Interactive,  Language-­‐Integrated  Cluster  Computing  

UC  BERKELEY  www.spark-­‐project.org    

Page 2: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Project  Goals  Extend  the  MapReduce  model  to  better  support  two  common  classes  of  analytics  apps:  » Iterative  algorithms  (machine  learning,  graphs)  » Interactive  data  mining  

Enhance  programmability:  » Integrate  into  Scala  programming  language  » Allow  interactive  use  from  Scala  interpreter  

Page 3: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Motivation  Most  current  cluster  programming  models  are  based  on  acyclic  data  flow  from  stable  storage  to  stable  storage  

Map  

Map  

Map  

Reduce  

Reduce  

Input   Output  

Page 4: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Motivation  

Map  

Map  

Map  

Reduce  

Reduce  

Input   Output  

Benefits  of  data  flow:  runtime  can  decide  where  to  run  tasks  and  can  automatically  

recover  from  failures  

Most  current  cluster  programming  models  are  based  on  acyclic  data  flow  from  stable  storage  to  stable  storage  

Page 5: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Motivation  Acyclic  data  flow  is  inefficient  for  applications  that  repeatedly  reuse  a  working  set  of  data:  » Iterative  algorithms  (machine  learning,  graphs)  » Interactive  data  mining  tools  (R,  Excel,  Python)  

With  current  frameworks,  apps  reload  data  from  stable  storage  on  each  query  

Page 6: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Solution:  Resilient  Distributed  Datasets  (RDDs)  Allow  apps  to  keep  working  sets  in  memory  for  efficient  reuse  

Retain  the  attractive  properties  of  MapReduce  » Fault  tolerance,  data  locality,  scalability  

Support  a  wide  range  of  applications  

Page 7: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Outline  Spark  programming  model  

Implementation  

Demo  

User  applications  

Page 8: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Programming  Model  Resilient  distributed  datasets  (RDDs)  » Immutable,  partitioned  collections  of  objects  » Created  through  parallel  transformations  (map,  filter,  groupBy,  join,  …)  on  data  in  stable  storage  » Can  be  cached  for  efficient  reuse  

Actions  on  RDDs  » Count,  reduce,  collect,  save,  …  

Page 9: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Example:  Log  Mining  Load  error  messages  from  a  log  into  memory,  then  interactively  search  for  various  patterns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Driver  

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks  

results  

Cache  1  

Cache  2  

Cache  3  

Base  RDD  Transformed  RDD  

Action  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  Result:  scaled  to  1  TB  data  in  5-­‐7  sec  

(vs  170  sec  for  on-­‐disk  data)  

Page 10: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

RDD  Fault  Tolerance  RDDs  maintain  lineage  information  that  can  be  used  to  reconstruct  lost  partitions  

Ex:  

 

 

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS  File   Filtered  RDD   Mapped  RDD  filter  

(func  =  _.contains(...))  map  

(func  =  _.split(...))  

Page 11: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Example:  Logistic  Regression  

Goal:  find  best  line  separating  two  sets  of  points  

+

+ + +

+

+

+ + +

– – –

– – –

+

target  

random  initial  line  

Page 12: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Page 13: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Logistic  Regression  Performance  

0  500  

1000  1500  2000  2500  3000  3500  4000  4500  

1   5   10   20   30  

Run

ning

 Tim

e  (s)  

Number  of  Iterations  

Hadoop  

Spark  

127  s  /  iteration  

first  iteration  174  s  further  iterations  6  s  

Page 14: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Spark  Applications  In-­‐memory  data  mining  on  Hive  data  (Conviva)  

Predictive  analytics  (Quantifind)  

City  traffic  prediction  (Mobile  Millennium)  

Twitter  spam  classification  (Monarch)  

Collaborative  filtering  via  matrix  factorization  

…  

Page 15: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Conviva  GeoReport  

Aggregations  on  many  keys  w/  same  WHERE  clause  

40×  gain  comes  from:  » Not  re-­‐reading  unused  columns  or  filtered  records  » Avoiding  repeated  decompression  » In-­‐memory  storage  of  deserialized  objects  

0.5  

20  

0   5   10   15   20  

Spark  

Hive  

Time  (hours)  

Page 16: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Frameworks  Built  on  Spark  

Pregel  on  Spark  (Bagel)  » Google  message  passing  model  for  graph  computation  » 200  lines  of  code  

Hive  on  Spark  (Shark)  » 3000  lines  of  code  » Compatible  with  Apache  Hive  » ML  operators  in  Scala  

Page 17: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Implementation  Runs  on  Apache  Mesos  to  share  resources  with  Hadoop  &  other  apps  

Can  read  from  any  Hadoop  input  source  (e.g.  HDFS)  

Spark   Hadoop   MPI  

Mesos  

Node   Node   Node   Node  

…  

No  changes  to  Scala  compiler  

Page 18: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Spark  Scheduler  Dryad-­‐like  DAGs  

Pipelines  functions  within  a  stage  

Cache-­‐aware  work  reuse  &  locality  

Partitioning-­‐aware  to  avoid  shuffles  

join  

union  

groupBy  

map  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:  

E:  

F:  

G:  

=  cached  data  partition  

Page 19: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line  

Required  two  changes:  » Modified  wrapper  code  generation  so  that  each  line  typed  has  references  to  objects  for  its  dependencies  » Distribute  generated  classes  over  the  network  

Page 20: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Demo  

Page 21: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Conclusion  Spark  provides  a  simple,  efficient,  and  powerful  programming  model  for  a  wide  range  of  apps  

Download  our  open  source  release:  

www.spark-­‐project.org  

[email protected]  

Page 22: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Related  Work  DryadLINQ,  FlumeJava  » Similar  “distributed  collection”  API,  but  cannot  reuse  datasets  efficiently  across  queries  

Relational  databases  » Lineage/provenance,  logical  logging,  materialized  views  

GraphLab,  Piccolo,  BigTable,  RAMCloud  » Fine-­‐grained  writes  similar  to  distributed  shared  memory  

Iterative  MapReduce  (e.g.  Twister,  HaLoop)  » Implicit  data  sharing  for  a  fixed  computation  pattern  

Caching  systems  (e.g.  Nectar)  » Store  data  in  files,  no  explicit  control  over  what  is  cached  

Page 23: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Behavior  with  Not  Enough  RAM  

68.8  

58.1  

40.7  

29.7  

11.5  

0  

20  

40  

60  

80  

100  

Cache  disabled  

25%   50%   75%   Fully  cached  

Iteration  time  (s)  

%  of  working  set  in  memory  

Page 24: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Fault  Recovery  Results  119  

57  

56  

58  

58   81

 

57  

59  

57  

59  

0  20  40  60  80  100  120  140  

1   2   3   4   5   6   7   8   9   10  

Iteratrion

 time  (s)  

Iteration  

No  Failure  Failure  in  the  6th  Iteration  

Page 25: Sparkspark.incubator.apache.org/talks/overview.pdfSpark Fast,&Interactive,&LanguageBIntegrated& ClusterComputing& UC&BERKELEY& && Project&Goals Extendthe&MapReduce&model&to&bettersupport&

Spark  Operations  

Transformations  (define  a  new  RDD)  

map  filter  

sample  groupByKey  reduceByKey  sortByKey  

flatMap  union  join  

cogroup  cross  

mapValues  

Actions  (return  a  result  to  driver  program)  

collect  reduce  count  save  

lookupKey