Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons...

8
Towards a Big Data Debugger in Apache Spark Tyson Condie, UCLA

Transcript of Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons...

Page 1: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

Towards  a  Big  Data  Debugger  in  Apache  Spark  

Tyson  Condie,  UCLA  

Page 2: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

Tuning  Spark  Applica>ons  

•  Commonly  through  visualiza>on  tools  

Page 3: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

Tuning  Spark  Applica>ons  •  Commonly  through  visualiza>on  tools  – Timeline  view  of  Spark  events  

Taken  from  hCps://databricks.com/blog/2015/06/22/understanding-­‐your-­‐spark-­‐applica>on-­‐through-­‐visualiza>on.html    

Page 4: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

Tuning  Spark  Applica>ons  •  Commonly  through  visualiza>on  tools  – Execu>on  DAG  

Taken  from  hCps://databricks.com/blog/2015/06/22/understanding-­‐your-­‐spark-­‐applica>on-­‐through-­‐visualiza>on.html    

Page 5: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

Tuning  Spark  Applica>ons  •  Commonly  through  visualiza>on  tools  – Visualiza>on  of  Spark  Streaming  sta>s>cs  

Taken  from  hCps://databricks.com/blog/2015/06/22/understanding-­‐your-­‐spark-­‐applica>on-­‐through-­‐visualiza>on.html    

Page 6: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

“I  would  like  to  understand  the  flow  of  control  through  the  Spark  source  code  on  the  worker  nodes  when  I  submit  my  applica>on  …  I  am  assuming  I  

should  setup  Spark  on  Eclipse  …  to  enable  stepping  through  Spark  source  code  on  the  worker  nodes.”  

Trying  to  debug  a  Spark  Applica>on  on  a  cluster…  

Page 7: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

AVer  5  months,  s>ll  no  good  answers!  

Page 8: Towards(aBig(DataDebugger(in( Apache(Spark( - HPTS - … ·  · 2015-10-02Tuning(Spark(Applicaons ... (“debugging(toolkits”(on(Apache(Spark(where(features(operate(at scale(and(impose(minimal(overheads(on((normal)(program(execu>on(•

Towards  Interac>ve  Debugging  for  Apache  Spark  

Goal:  Develop  “debugging  toolkits”  on  Apache  Spark  where  features  operate  at  scale  and  impose  minimal  overheads  on  (normal)  program  execu>on  •  Data  provenance  similar  to  databases  but  without  going  to  offline  •  Tradi>onal  debugging  features:  Breakpoints;  Watchpoints;  Stepping    •  Record-­‐level  error/excep>on  handling:  Crash  culprit;  Outlier  iden>fica>on  •  Execu>on  replay:  On  input  or  intermediate  data;  Leading  to  a  given  result  e.g.,  outlier,  crash  culprit  

Ti)an  Library:  Provides  capture  and  interac>ve  analysis  of  data  provenance  •  Data  Provenance  supported  by  enabling  record-­‐level  tracing  in  Spark’s  dataflow    

Ø  Provenance  recorded  as  data  records  are  pipelined  through  transforma>ons  Ø  Provenance  exposed  to  the  programmer  as  Resilient  Distributed  Datasets  (RDDs)  for  analysis  

•  To  appear  in  PVLDB  Volume  9,  Issue  3.  •  Following  on  work:  Breakpoints,  Watchpoints  and  Stepping  is  under  submission  to  ICSE  2016.    Supported  through  •  BD2K  (Big  Data  to  Knowledge)  NIH  Grant    •  Industry  grants  from  IBM  Research  and  Intel  

Collaborators:  Ma,eo  Interlandi  (post-­‐doc),  Muhammad  Ali  Gulzar,  Sai  Deep  Tetali,  and  UCLA  Faculty-­‐-­‐-­‐Miryung  Kim,  Todd  Millstein