Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

37
Weaving Dataflows with Silk Taro L. Saito Treasure Data, Inc. [email protected] September 6th, 2014 ScalaMatsuri @ Tokyo 1 xerial.org/silk

description

Silk is a framework for building dataflows in Scala. In Silk users write data processing code with collection operators (e.g., map, filter, reduce, join, etc.). Silk uses Scala Macros to construct a DAG of dataflows, nodes of which are annotated with variable names in the program. By using these variable names as markers in the DAG, Silk can support interruption and resume of dataflows and querying the intermediate data. By separating dataflow descriptions from its computation, Silk enables us to switch executors, called weavers, for in-memory or cluster computing without modifying the code. In this talk, we will show how Silk helps you run data-processing pipelines as you write the code.

Transcript of Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Page 1: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Taro  L.  SaitoTreasure  Data,  [email protected]

September  6th,  2014  ScalaMatsuri  @  Tokyo

1xerial.org/silk

Page 2: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

About Me

2xerial.org/silk

Page 3: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Treasure Data Console

3xerial.org/silk

Page 4: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Processing Job Table

4xerial.org/silk

Page 5: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Functional Style Writing

5xerial.org/silk

Page 6: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Need an Optimization?

6xerial.org/silk

Page 7: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Procedural Style Writing

l  Describes  How  to  Process  Data.

7xerial.org/silk

Page 8: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Declarative Style Writing

l  Less  programmingl  System  decides  how  to  optimize  the  code

l  Hash  joins,  bloom  filters  and  various  optimization  techniques  are  now  available.

8xerial.org/silk

Page 9: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Weaving Silk

l  Making  data  processing  code  independent  from  the  execution  method!

9xerial.org/silk

In-memory weaver

Cluster weaver (Spark?)

MapReduce weaverResult

Silk[A] (operation DAG)

Weave (Execute) Silk Product

Your own weaver (using TD?)

Page 10: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Cluster Weaver: Logical Plan to Physical Plan on Cluster

l  Physical  plan  on  cluster

10xerial.org/silk

I1

I2

I3

P1

P2

P3

P1

P2

P3

P1

P2

P3

S1

S2

S3

S1

S2

S3

S1

S2

S3

R1

S1

S1

S1

S2

S2

S2

S3

S3

S3

P1

P1

P1

P2

P2

P2

P3

P3

P3

R2

R3

Partition (hashing)

serialization shuffle deserialization merge sort

Silk[people]

Scatter

Page 11: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

DAG-based Data Processing Engines

l  Sparkl  Creates  a  task  schedule  for  distributed  processing

l  Summingbirdl  Integrates  stream  and  batch  data  processing

l  e.g.  Running  Scalding  and  Storm  at  the  same  time

l  Apache  Tezl  Creates  a  dag  schedule  for  optimizing  MapReduce  pipelines

l  GNU  Makefilel  Describes  a  pipeline  of  UNIX  commands

                 Why  do  we  need  another  framework?11xerial.org/silk

Page 12: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Challenge: Isolate Code Writing and Its Execution

l  Why  canʼ’t  we  run  the  program  until  finish  writing?

l  How  can  we  departure  from  compile-‐‑‒then-‐‑‒run  paradigm?

12xerial.org/silk

weaverResult

Silk[A] (operation DAG)

Weave (Execute) Silk Product

Result

Result

Page 13: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

l  W

13xerial.org/silk

Page 14: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Genome Science is A Big Data Science

l  By  sequencing,  we  can  find  3  millions  of  SNPs  for  each  personl  To  find  the  cause  of  disease  (one  or  a  few  SNPs),  we  need  to  sequence  as  many  samples  as  possible  for  

narrowing  down  the  candidate  SNPs

l  Input:  FASTQ  file(s)    500GB  (50x  coverage,  200  million  entries)l  DNA  Sequencer  (Illumina,  PacBio,  etc.)

l  f:    An  alignment  programl  Output:  Alignment  results  750GB  (sequence  +  alignment  data)

l  Total  storage  space  required:  1.2TB  

Outputf

Input

University of Tokyo Genome Browser (UTGB)

14xerial.org/silk

Page 15: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Human Genome Data Processing Workflows in Silk

l  c”(UNIX  Command)”

15xerial.org/silk

Page 16: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Human Genome Data Processing Workflows

l  Makefile:  The  result  ($@)  is  stored  into  a  filel  Silk:  The  result  is  stored  in  variablel  Computation  of  each  command  may  take  1  or  more  hours    

16xerial.org/silk

Page 17: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

SBT: A Good Hint

l  SBTl  Supports  incremental  compilation  and  testing

l  >sbt  ~∼test-‐‑‒onlyl  Monitor  source  code  changel  Running  specific  tests

l  >  sbt  ~∼test-‐‑‒quickl  Running  failed  tests  only  

l  How  do  we  compute  the  not-‐‑‒yet  started  part  of  a  Scala  program?

l  We  need  to  know:l   A-‐‑‒>B  and  D-‐‑‒>E  are  running

l  If  B  is  finished,  we  can  start  B-‐‑‒>C17xerial.org/silk

A f B C g

D E

F

G

Page 18: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Writing A Dataflow

l  Apply  function  f  to  the  input  A,  then  produce  the  output  Bl  This  step  may  take  more  than  1  hours  in  big  data  analysis

18

A B

f

val B = A.map(f)

xerial.org/silk

a Program v1

Page 19: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Distribution and Recovery

l  Resume  only  B2  =  A2.map(f)

19xerial.org/silk

A0

A1

A2

B1

B2

f

B0

Failure!

A B

fa Program v1

Retry

Page 20: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Extending Dataflows

l  While  running  program  v1,  we  may  want  to  add  another  code  (program  v2)

l  We  need  to  know  variable  B  is  now  being  processed

20

A B

f

C

gProgram v1

Program v2

xerial.org/silk

Page 21: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Labeling Program with Variable Names

l  Storing  intermediate  results  using  variable  namesl  variable  names  :=  program  markers

l  But,  we  lost  the  variable  names  after  compilation

l  Solution:  Extract  variable  names  from  AST  upon  compile  timel  Using  Scala  Macros  (Since  Scala  2.10)

21

A B

f

val B = A.map(f) val C = B.map(g)

C

gProgram v1

Program v2

xerial.org/silk

Page 22: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Scala Program (AST) to DAG Schedule (Logical Plan)

l  Translate  a  program  into  a  set  of  Silk  operation  objectsl  val  B  =  MapOp(input:A,  output:”B”,  function:f)l  val  C  =  MapOp(input:B,  output:”C”,  function:g)

l  Operations  in  Silk  form  a  DAGl  val  C  =  MapOp(                            input:MapOp(input:A,  output:”B”,  function:f),    output:”C”,  function:g)

22

A B

f

C

gProgram v1

Program v2

xerial.org/silk

Page 23: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Using Scala Macros

l  Produce  operation  objects  with  Scala  Macros

l  map(f:A=>B)    produces  MapOp[A,  B](…)

l  Why  do  we  need  to  use  Macro  here?l  To  extract  FContext    (target  variable  name,  enclosing  method,  class,  etc.)  from  AST.

23xerial.org/silk

Page 24: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

l  s

24xerial.org/silk

Page 25: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Extract target variable name and enclosing method

25xerial.org/silk

Page 26: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Finding Target Variable

26xerial.org/silk

Page 27: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

l  Translate  a  program  into  a  set  of  Silk  operation  objectsl  val  B  =  MapOp(input:A,  output:”B”,  function:f)l  val  C  =  MapOp(input:B,  output:”C”,  function:g)

l  Silk  uses  these  variable  names  to  store  the  intermediate  data

27

A B

f

C

gProgram v1

Program v2

xerial.org/silk

Page 28: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

l  Silk  defines  various  types  of  operations  

28xerial.org/silk

Page 29: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Object-Oriented Dataflow Programming

l  Reusing  and  overriding  dataflows

29xerial.org/silk

Page 30: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Summary

l  Declarative-‐‑‒style  coding  is  necessary  for  creating  DAG  schedulel  DAG  schedules  are  labeled  with  variable  names  using  ScalaMacros

l  Weaver:  An  abstraction  of  how  to  execute  the  code.l  Weaver  manages  the  running  and  finished  parts  of  the  code

30xerial.org/silk

weaver

Result

Silk[A] (operation DAG)

Weave (Execute) Silk Product

Result

Result

Cluster weaver

Page 31: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

http://xerial.org/silk

31xerial.org/silk

Page 32: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Copyright  ©2014  Treasure  Data.    All  Rights  Reserved.   32  

WE  ARE  HIRING!

www.treasuredata.com

Page 33: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Silk[A]

Resource Table (CPU, memory)

User program builds workflows

Static optimization

DAG Schedule

•  read file, toSilk •  map, reduce, join, •  groupBy •  UNIX commands •  etc.

•  Register ClassBox •  Submit schedule

Silk Master

dispatch

Silk Client

ZooKeeper Node Table

Slice Table

Task Scheduler

Task Status

Resource Monitor

Task Executor

Silk Client

Task Scheduler

Resource Monitor

Task Executor

•  Submits tasks •  Run-time optimization •  Resource allocation •  Monitoring resource usage •  Launches Web UI •  Manages assigned task status •  Object serialization/deserialization •  Serves slice data

ensemble mode (at least 3 ZK instances)

•  Leader election •  Collects locations of slices

and ClassBox jars •  Watches active nodes •  Watches available resources

Local ClassBox classpaths & local jar files

ClassBox Table

weave

•  Dispatches tasks to clients •  Manages master resource table •  Authorizes resource allocation •  Automatic recovery by

leader election in ZK

Data Server Data Server

Silk[A]

SilkSeq[A] SilkSingle[A]

weave

A single object

Seq[A] sequence of objects

Weaving Silk materializes objectsLocal machine

Cluster

xerial.org/silk 33

Page 34: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Integrating Varieties of Data Sources

l  WormTSS:  http://wormtss.utgenome.org/l  Integrating  various  data  sources

34xerial.org/silk

Page 35: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Varieties of Data Analysis

Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month.

35xerial.org/silk

Page 36: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Makefile

l  Describes  dependencies  of  commands  through  filesl  Good:  We  can  resume  and  update  the  data  flow  processingl  Bad:  Makefile  of  WormTSS  analysis  exceeds  1,000  lines

36

Page 37: Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

Weaving Dataflows with Silk

Splitting Data Analysis Into Command Modules

l  Added  a  new  command  as  we  needed  a  new  analysis  and  data  processingl  The  result:

l  hundreds  of  commands!l  #  of  files  limits  the  parallelism  

37