ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh"...

23
1 Apache Spark 101 Lance Co Ting Keh Senior So5ware Engineer, Machine Learning @ Box

Transcript of ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh"...

Page 1: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

1  

Apache  Spark  101  

   Lance  Co  Ting  Keh  

Senior  So5ware  Engineer,  Machine  Learning  @  Box  

 

Page 2: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

2  

Outline  I.  About  me  II.  Distributed  CompuIng  at  a  High  Level  

III.  Disk  versus  Memory  based  Systems  IV.  Spark  Core  

I.  Brief  background  II.  Benchmarks  and  Comparisons  III.  What  is  an  RDD  IV.  RDD  AcIons  and  TransformaIons  V.  Caching  and  SerializaIon    VI.  Anatomy  of  a  Program  VII. The  Spark  Family  

 

 

• Our  Approach  ‒ Semi-­‐supervised  learning  (  mulI  edged  similarity  graph)  

‒ Independent  of  training  set  (very  few  labels)  ‒ Can  be  reused  for  different  usecases    

Page 3: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

3  

Why  Distributed  CompuIng?  Divide  and  Conquer  

Problem  Single  machine  cannot  complete  the  computaIon  at  hand    SoluIon  Parallelize  the  job  and  distribute  work  among  a  network  of  machines    

Page 4: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

4  

Issues  Arise  in  Distributed  CompuIng  View  the  world  from  the  eyes  of  a  single  worker  

•  How  do  I  distribute  an  algorithm?    •  How  do  I  par++on  my  dataset?  •  How  do  I  maintain  a  single  consistent  view  of  a  shared  state?  

•  How  do  I  recover  from  machine  failures?  •  How  do  I  allocate  cluster  resources?  •  …..  

 

Page 5: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

5  

Finding  majority  element  in  a  single  machine  

List(20,  18,  20,  18,  20)  

Think  distributed  

Page 6: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

6  

Finding  majority  element  in  a  distributed  dataset  Think  distributed  

List(1,  18,  1,  18,  1)  

List(2,  18,  2,  18,  2)  

List(3,  18,  3,  18,  3)  

List(4,  18,  4,  18,  4)  

List(5,  18,  5,  18,  5)  

Page 7: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

7  

Finding  majority  element  in  a  distributed  dataset  Think  distributed  

List(1,  18,  1,  18,  1)  

List(2,  18,  2,  18,  2)  

List(3,  18,  3,  18,  3)  

List(4,  18,  4,  18,  4)  

List(5,  18,  5,  18,  5)  

18  

Page 8: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

8  

Disk  Based  vs  Memory  Based  Frameworks  

• Disk  Based  Frameworks  ‒ Persists  intermediate  results  to  disk  ‒ Data  is  reloaded  from  disk  with  every  query  ‒ Easy  failure  recovery  ‒ Best  for  ETL  like  work-­‐loads  ‒ Examples:  Hadoop,  Dryad  

Acyclic  data  flow  

Map  

Map  

Map  

Reduce  

Reduce  

Input   Output  

Image  courtesy  of  Matei  Zaharia,  Introduction  to  Spark  

Page 9: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

9  

Disk  Based  vs  Memory  Based  Frameworks  

• Memory  Based  Frameworks  ‒ Circumvents  heavy  cost  of  I/O  by  keeping  intermediate  results  in  memory  

‒ SensiIve  to  availability  of  memory  ‒ Remembers  operaIonsapplied  to  dataset  ‒ Best  for  iteraIve  workloads  ‒ Examples:  Spark,  Flink  

Reuse  working  data  set  in  memory    

iter.  1   iter.  2   .    .    .  

Input  Image  courtesy  of  Matei  Zaharia,  Introduction  to  Spark  

Page 10: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

10  

The  rest  of  the  talk  

I.  Spark  Core  I.  Brief  background  II.  Benchmarks  and  Comparisons  III.  What  is  an  RDD  IV.  RDD  AcIons  and  TransformaIons  V.  Spark  Cluster  VI.  Anatomy  of  a  Program  VII. The  Spark  Family  

Page 11: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

11  

Spark  Background  

•  Amplab  UC  Berkley  

•  Project  Lead:  Dr.  Matei  Zaharia  

•  First  paper  published  on  RDD’s  was  in  2012  •  Open  sourced  from  day  one,  growing  number  of  contributors  

•  Released  its  1.0  version  May  2014.  Currently  in  1.2.1  

•  Databricks  company  established  to  support  Spark  and  all  its  related  technologies.  Matei  currently  sits  as  its  CTO    

•  Amazon,  Alibaba,  Baidu,  eBay,  Groupon,  Ooyala,  OpenTable,  Box,  Shopify,  TechBase,  Yahoo!  

Arose  from  an  academic  selng  

Page 12: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

12  

Spark  versus  Scalding  (Hadoop)  Clear  win  for  iteraIve  applicaIons  

Page 13: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

13  

Ad-­‐hoc  batch  queries  

Page 14: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

14  

Resilient  Distributed  Datasets  (RDDs)  

• Main  object  in  Spark’s  universe  

•  Think  of  it  as  represenIng  the  data  at  that  stage  in  the  operaIon  

• Allows  for  coarse-­‐grained  transformaIons  (e.g.  map,  group-­‐by,  join)  

• Allows  for  efficient  fault  recovery  using  lineage  ‒ Log  one  operaIon  to  apply  to  many  elements  ‒ Recompute  lost  parIIons  of  dataset  on  failure  ‒ No  cost  if  nothing  fails  

Page 15: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

15  

RDD  AcIons  and  TransformaIons  

• TransformaIons  ‒   Lazy  operaIons  applied  on  an  RDD  ‒   Creates  a  new  RDD  from  an  exisIng  RDD  ‒   Allows  Spark  to  perform  opImizaIons    ‒   e.g.  map,  filter,  flatMap,  union,  intersecIon,  disInct,  reduceByKey,  groupByKey  

• AcIons  ‒   Returns  a  value  to  the  driver  program  a5er  computaIon  ‒   e.g.  reduce,  collect,  count,  first,  take,  saveAsFile  

TransformaIons  are  realized  when  an  acIon  is  called  

Page 16: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

16  

RDD  RepresentaIon  

• Simple  common  interface:  – Set  of  parIIons  – Preferred  locaIons  for  each  parIIon  – List  of  parent  RDDs  – FuncIon  to  compute  a  parIIon  given  parents  – OpIonal  parIIoning  info  

• Allows  capturing  wide  range  of  transformaIons  

 

Slide  courtesy  of  Matei  Zaharia,  Introduction  to  Spark  

Page 17: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

17  

Spark  Cluster  

Driver  •  Entry  point  of  Spark  applicaIon  •  Main  Spark  applicaIon  is  ran  here  •  Results  of  “reduce”  operaIons  are  aggregated  here  

Driver  

Page 18: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

18  

Spark  Cluster  

Driver  Master  

Master  •  Distributed  coordinaIon  of  Spark  workers  including:  

§  Health  checking  workers  §  Reassignment  of  failed  tasks  §  Entry  point  for  job  and  cluster  metrics  

Page 19: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

19  

Spark  Cluster  

Driver  Master  

Worker  

Worker  

Worker  

Worker  

Worker  •  Spawns  executors  to  perform  tasks  on  parIIons  of  data  

Page 20: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

20  

Example:  Log  Mining  Load  error  messages  from  a  log  into  memory,  then  interacIvely  search  for  various  paserns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Driver  

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

. . .

tasks  

results  

Msgs.  1  

Msgs.  2  

Msgs.  3  

Base  RDD  Transformed  RDD  

AcIon  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  

Result:  scaled  to  1  TB  data  in  5-­‐7  sec  (vs  170  sec  for  on-­‐disk  data)  

Slide  courtesy  of  Matei  Zaharia,  Introduction  to  Spark  

Page 21: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

21  

The  Spark  Family  Cheaper  by  the  dozen  

•  Aside  from  its  performance  and  API,  the  diverse  tool  set  available  in  Spark  is  the  reason  for  its  wide  adopIon    1.  Spark  Streaming  2.  Spark  SQL  3.  MLlib  4.  GraphX    

Page 22: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

22  

Lambda  Architecture  

Page 23: ApacheSpark101 - Duke University · 1 Apache"Spark"101" "" Lance"Co"Ting"Keh" Senior"So5ware"Engineer,"Machine"Learning"@Box" "

23  

Lambda  Architecture  Unified  Framework  

Spark  Core  

Spark  Streaming  

Spark  SQL