Streaming Distributed Data Processing with Silk #deim2014

23
Streaming Distributed Data Processing with Silk Taro L. Saito University of Tokyo [email protected] March 3 rd , 2014 DEIM2014 1 xerial.org/silk Twitter @taroleo

description

A framework written in Scala for describing distributed data processing programs.

Transcript of Streaming Distributed Data Processing with Silk #deim2014

Page 1: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Taro L. SaitoUniversity of Tokyo

[email protected]

March 3rd, 2014 DEIM2014

1xerial.org/silk Twitter @taroleo

Page 2: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Distributed Data Processing

Translate this data processing program

into a cluster computing program

2

A B

A0

A1

A2

B1

B2

f

B0

C

C

g

map reduce

f g

Page 3: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Streaming Distributed Data Processing

What is streaming?   

Silk: A framework for building and running complex workflows of distributed data processing

3

Af

B C

g

D E

F

G

Page 4: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Problem Definition

How do we run the distributed data processing while extending the program?

4

Af

B C

g

D E

F

G

Page 5: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Silk

Describing Dataflows in Scala A dataflow in Silk is a sequence of function calls

Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object

5

Page 6: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Object-Oriented Dataflow Programming

Reusing and overriding dataflow programs

6

Page 7: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Big Data Volumes in Human Genome Analysis

Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) DNA Sequencer (Illumina, PacBio, etc.)

f: An alignment program Output: Alignment results 750GB (sequence + alignment data)

Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs)

Outputf

Input

University of Tokyo Genome Browser (UTGB)

7xerial.org/silk Twitter @taroleo

Page 8: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Varieties of Scientific Data and Analysis

WormTSS: http://wormtss.utgenome.org/ Integrating various data sources, hundreds of data analysis…

8xerial.org/silk Twitter @taroleo

Page 9: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Produced Thousands of Data Analysis Charts

Using R, JFreeChart, etc.

Need a automated pipeline to redo the entire analysis for answering the paper review within a month.

9xerial.org/silk Twitter @taroleo

Page 10: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Writing A Dataflow

Apply function f to the input A, then produce the output B This step may take more than 1 hours in big data analysis

10

A B

f

val B = A.map(f)

xerial.org/silk Twitter @taroleo

a Program v1

Page 11: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Distribution and Fault Tolerance

Resume only B2 = A2.map(f)

11

A0

A1

A2

B1

B2

f

B0

Failure!

A B

fa Program v1

Retry

Page 12: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Extending Dataflows

While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C?

12

A B

f

C

gProgram v1

Program v2

xerial.org/silk Twitter @taroleo

Page 13: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Marking to A Program

Storing intermediate results using variable names variable names := program markers!!

But, we lost variable names after compilation

Extracting AST and variable names upon compile time Using Scala Macros (Since Scala 2.10)

13

A B

f

val B = A.map(f)val C = B.map(g)

C

gProgram v1

Program v2

xerial.org/silk Twitter @taroleo

Page 14: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Scala Program (AST) to DAG Schedule (Logical Plan)

Translating a program (AST) into a set of Silk operations (DAG) val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g)

Operations in Silk can be nested val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)

14

A B

f

C

gProgram v1

Program v2

xerial.org/silk Twitter @taroleo

Page 15: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Weaving Silks

Data analysis code is independent from weavers

15

In-memory weaver

Cluster weaver

Hadoop weaver

Result

Silk[A] (operation DAG)

Weave Output

Page 16: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Cluster Weaver: Logical Plan to Physical Plan on Cluster

Logical plan GroupByOp(in:people, out:g, key: {_.dept.id})

Physical plan

16

I1

I2

I3

P1

P2

P3

P1

P2

P3

P1

P2

P3

S1

S2

S3

S1

S2

S3

S1

S2

S3

R1

S1

S1

S1

S2

S2

S2

S3

S3

S3

P1

P1

P1

P2

P2

P2

P3

P3

P3

R2

R3

Partition (hashing)

serialization shuffle deserialization merge sort

Silk[people]

Scatter

Page 17: Streaming Distributed Data Processing with Silk #deim2014

Silk[A]

Resource Table(CPU, memory)

User programbuilds workflows

Static optimization

DAG Schedule

• read file, toSilk• map, reduce, join, • groupBy• UNIX commands• etc.

• Register ClassBox• Submit schedule

Silk Master

dispatch

Silk Client

ZooKeeper Node Table

Slice Table

Task Scheduler

Task Status

Resource Monitor

Task Executor

Silk Client

Task Scheduler

Resource Monitor

Task Executor

• Submits tasks• Run-time optimization• Resource allocation• Monitoring resource usage• Launches Web UI• Manages assigned task status• Object serialization/deserialization• Serves slice data

ensemble mode(at least 3 ZK instances)

• Leader election• Collects locations of slices

and ClassBox jars• Watches active nodes• Watches available resources

Local ClassBox classpaths & local jar files

ClassBox Table

weave

• Dispatches tasks to clients• Manages master resource table• Authorizes resource allocation• Automatic recovery by

leader election in ZK

Data Server Data Server

Silk[A]

SilkSeq[A]SilkSingle[A]

weave

Asingle object

Seq[A]sequence of objects

Weaving Silk materializes objects

Local machine

Cluster

xerial.org/silk Twitter @taroleo 17

Page 18: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Static Optimization

Tree transformation map(f).map(g) => map(g ・ f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) ( Reduces

intermediate data) Pushing-down selection Retrieves only accessed fields in an object

Analyzing the byte code of functions with ASM

Rewriting logical plans using pattern matching in Scala Easy to add optimization rules

18

Page 19: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Run-time Optimization

Adjusting the number of data splits According to the available cluster resources.

Multi-core execution

Omega-based task scheduler Sharing the cluster resource table between nodes

Each node determines how to use the resource Monitoring actual CPU/memory resources periodically

19

Page 20: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

UNIX Command Workflows in Silk

c”(UNIX Command)”

20xerial.org/silk Twitter @taroleo

Page 21: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

Buffer Management

Silk frequently uses distributed memory (like Spark) LArray[A]

Allocating Off-heap (outside JVM heap ) memories sun.misc.Unsafe Github : https://github.com/xerial/larray

Immediate memory deallocation (free) To eliminate OutOfMemoryException and GC-stall

Fast memory allocation Skips zero-filling

Object Serialization Extending msgpack

Scala Pickling Inject ser/dser codes

Off-heap objects

21xerial.org/silk Twitter @taroleo

Page 22: Streaming Distributed Data Processing with Silk #deim2014

xerial.org/silk Twitter @taroleo

Streaming Distributed Data Processing with Silk

Summary

Silk A framework for distributed data processing for all data scientists

including non-experts in distributed data processing (e.g. Biologists) Object-oriented data processing programming

Reuse, override and mix-in Optimizing data flow programs

Similar to query optimization in DBMS

Analyze Data as You Write Programs! Database research now enters program optimization.

In Future Workflow queries

Making queries against dataflow program Monitoring intermediate results

Multi-user program execution

22

Page 23: Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data Processing with Silk

http://xerial.org/silk

23xerial.org/silk Twitter @taroleo