Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
-
Upload
taro-l-saito -
Category
Technology
-
view
596 -
download
1
description
Transcript of Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk
Taro L. SaitoTreasure Data, [email protected]
September 6th, 2014 ScalaMatsuri @ Tokyo
1xerial.org/silk
Weaving Dataflows with Silk
About Me
2xerial.org/silk
Weaving Dataflows with Silk
Treasure Data Console
3xerial.org/silk
Weaving Dataflows with Silk
Processing Job Table
4xerial.org/silk
Weaving Dataflows with Silk
Functional Style Writing
5xerial.org/silk
Weaving Dataflows with Silk
Need an Optimization?
6xerial.org/silk
Weaving Dataflows with Silk
Procedural Style Writing
l Describes How to Process Data.
7xerial.org/silk
Weaving Dataflows with Silk
Declarative Style Writing
l Less programmingl System decides how to optimize the code
l Hash joins, bloom filters and various optimization techniques are now available.
8xerial.org/silk
Weaving Dataflows with Silk
Weaving Silk
l Making data processing code independent from the execution method!
9xerial.org/silk
In-memory weaver
Cluster weaver (Spark?)
MapReduce weaverResult
Silk[A] (operation DAG)
Weave (Execute) Silk Product
Your own weaver (using TD?)
Weaving Dataflows with Silk
Cluster Weaver: Logical Plan to Physical Plan on Cluster
l Physical plan on cluster
10xerial.org/silk
I1
I2
I3
P1
P2
P3
P1
P2
P3
P1
P2
P3
S1
S2
S3
S1
S2
S3
S1
S2
S3
R1
S1
S1
S1
S2
S2
S2
S3
S3
S3
P1
P1
P1
P2
P2
P2
P3
P3
P3
R2
R3
Partition (hashing)
serialization shuffle deserialization merge sort
Silk[people]
Scatter
Weaving Dataflows with Silk
DAG-based Data Processing Engines
l Sparkl Creates a task schedule for distributed processing
l Summingbirdl Integrates stream and batch data processing
l e.g. Running Scalding and Storm at the same time
l Apache Tezl Creates a dag schedule for optimizing MapReduce pipelines
l GNU Makefilel Describes a pipeline of UNIX commands
Why do we need another framework?11xerial.org/silk
Weaving Dataflows with Silk
Challenge: Isolate Code Writing and Its Execution
l Why canʼ’t we run the program until finish writing?
l How can we departure from compile-‐‑‒then-‐‑‒run paradigm?
12xerial.org/silk
weaverResult
Silk[A] (operation DAG)
Weave (Execute) Silk Product
Result
Result
Weaving Dataflows with Silk
l W
13xerial.org/silk
Weaving Dataflows with Silk
Genome Science is A Big Data Science
l By sequencing, we can find 3 millions of SNPs for each personl To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for
narrowing down the candidate SNPs
l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)l DNA Sequencer (Illumina, PacBio, etc.)
l f: An alignment programl Output: Alignment results 750GB (sequence + alignment data)
l Total storage space required: 1.2TB
Outputf
Input
University of Tokyo Genome Browser (UTGB)
14xerial.org/silk
Weaving Dataflows with Silk
Human Genome Data Processing Workflows in Silk
l c”(UNIX Command)”
15xerial.org/silk
Weaving Dataflows with Silk
Human Genome Data Processing Workflows
l Makefile: The result ($@) is stored into a filel Silk: The result is stored in variablel Computation of each command may take 1 or more hours
16xerial.org/silk
Weaving Dataflows with Silk
SBT: A Good Hint
l SBTl Supports incremental compilation and testing
l >sbt ~∼test-‐‑‒onlyl Monitor source code changel Running specific tests
l > sbt ~∼test-‐‑‒quickl Running failed tests only
l How do we compute the not-‐‑‒yet started part of a Scala program?
l We need to know:l A-‐‑‒>B and D-‐‑‒>E are running
l If B is finished, we can start B-‐‑‒>C17xerial.org/silk
A f B C g
D E
F
G
Weaving Dataflows with Silk
Writing A Dataflow
l Apply function f to the input A, then produce the output Bl This step may take more than 1 hours in big data analysis
18
A B
f
val B = A.map(f)
xerial.org/silk
a Program v1
Weaving Dataflows with Silk
Distribution and Recovery
l Resume only B2 = A2.map(f)
19xerial.org/silk
A0
A1
A2
B1
B2
f
B0
Failure!
A B
fa Program v1
Retry
Weaving Dataflows with Silk
Extending Dataflows
l While running program v1, we may want to add another code (program v2)
l We need to know variable B is now being processed
20
A B
f
C
gProgram v1
Program v2
xerial.org/silk
Weaving Dataflows with Silk
Labeling Program with Variable Names
l Storing intermediate results using variable namesl variable names := program markers
l But, we lost the variable names after compilation
l Solution: Extract variable names from AST upon compile timel Using Scala Macros (Since Scala 2.10)
21
A B
f
val B = A.map(f) val C = B.map(g)
C
gProgram v1
Program v2
xerial.org/silk
Weaving Dataflows with Silk
Scala Program (AST) to DAG Schedule (Logical Plan)
l Translate a program into a set of Silk operation objectsl val B = MapOp(input:A, output:”B”, function:f)l val C = MapOp(input:B, output:”C”, function:g)
l Operations in Silk form a DAGl val C = MapOp( input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g)
22
A B
f
C
gProgram v1
Program v2
xerial.org/silk
Weaving Dataflows with Silk
Using Scala Macros
l Produce operation objects with Scala Macros
l map(f:A=>B) produces MapOp[A, B](…)
l Why do we need to use Macro here?l To extract FContext (target variable name, enclosing method, class, etc.) from AST.
23xerial.org/silk
Weaving Dataflows with Silk
l s
24xerial.org/silk
Weaving Dataflows with Silk
Extract target variable name and enclosing method
25xerial.org/silk
Weaving Dataflows with Silk
Finding Target Variable
26xerial.org/silk
Weaving Dataflows with Silk
l Translate a program into a set of Silk operation objectsl val B = MapOp(input:A, output:”B”, function:f)l val C = MapOp(input:B, output:”C”, function:g)
l Silk uses these variable names to store the intermediate data
27
A B
f
C
gProgram v1
Program v2
xerial.org/silk
Weaving Dataflows with Silk
l Silk defines various types of operations
28xerial.org/silk
Weaving Dataflows with Silk
Object-Oriented Dataflow Programming
l Reusing and overriding dataflows
29xerial.org/silk
Weaving Dataflows with Silk
Summary
l Declarative-‐‑‒style coding is necessary for creating DAG schedulel DAG schedules are labeled with variable names using ScalaMacros
l Weaver: An abstraction of how to execute the code.l Weaver manages the running and finished parts of the code
30xerial.org/silk
weaver
Result
Silk[A] (operation DAG)
Weave (Execute) Silk Product
Result
Result
Cluster weaver
Weaving Dataflows with Silk
http://xerial.org/silk
31xerial.org/silk
Copyright ©2014 Treasure Data. All Rights Reserved. 32
WE ARE HIRING!
www.treasuredata.com
Silk[A]
Resource Table (CPU, memory)
User program builds workflows
Static optimization
DAG Schedule
• read file, toSilk • map, reduce, join, • groupBy • UNIX commands • etc.
• Register ClassBox • Submit schedule
Silk Master
dispatch
Silk Client
ZooKeeper Node Table
Slice Table
Task Scheduler
Task Status
Resource Monitor
Task Executor
Silk Client
Task Scheduler
Resource Monitor
Task Executor
• Submits tasks • Run-time optimization • Resource allocation • Monitoring resource usage • Launches Web UI • Manages assigned task status • Object serialization/deserialization • Serves slice data
ensemble mode (at least 3 ZK instances)
• Leader election • Collects locations of slices
and ClassBox jars • Watches active nodes • Watches available resources
Local ClassBox classpaths & local jar files
ClassBox Table
weave
• Dispatches tasks to clients • Manages master resource table • Authorizes resource allocation • Automatic recovery by
leader election in ZK
Data Server Data Server
Silk[A]
SilkSeq[A] SilkSingle[A]
weave
A single object
Seq[A] sequence of objects
Weaving Silk materializes objectsLocal machine
Cluster
xerial.org/silk 33
Weaving Dataflows with Silk
Integrating Varieties of Data Sources
l WormTSS: http://wormtss.utgenome.org/l Integrating various data sources
34xerial.org/silk
Weaving Dataflows with Silk
Varieties of Data Analysis
Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month.
35xerial.org/silk
Weaving Dataflows with Silk
Makefile
l Describes dependencies of commands through filesl Good: We can resume and update the data flow processingl Bad: Makefile of WormTSS analysis exceeds 1,000 lines
36
Weaving Dataflows with Silk
Splitting Data Analysis Into Command Modules
l Added a new command as we needed a new analysis and data processingl The result:
l hundreds of commands!l # of files limits the parallelism
37