Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

Post on 25-Feb-2016

35 views 0 download

description

Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame. I want to complete big workloads. 3.6B Hamming distance computations, .02 seconds each on a 2GHz dual-core desktop computer, each creating 1 real number output: 2.3 CPUYrs and 29GB of output. - PowerPoint PPT Presentation

Transcript of Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

1Christopher Moretti – University of Notre Dame4/21/2009

Abstractions for Data-Intensive Computing

on Condor

Christopher Moretti

University of Notre Dame

3Christopher Moretti – University of Notre Dame4/21/2009

I want to complete big workloads3.6B Hamming distance computations, .02 seconds each

on a 2GHz dual-core desktop computer, each creating 1 real number output: 2.3 CPUYrs and 29GB of output.

85M 1000x1000 dynamic programming tables, .04 seconds each on a 3GHz quad-core Xeon server: 39 CPUDays requiring a total of 77GB of input data.

500x500 recurrence matrix, recurrence functions 7 seconds each on the desktop: 22 CPUDays, not completely independently parallelizable.

… can these be run this week? How about this afternoon? Can we even turn these into “lunchtime” or “coffee refill” problems?

4Christopher Moretti – University of Notre Dame4/21/2009

How? On my workstation.

Write my program, make sure to make it partitionable, because it takes a really long time and might crash, debug it. Now run it for 39 days – 2.3 years.

On my department’s 128-node research cluster Learn MPI, determine how I want to move many GBs of data

around, re-write my program and re-debug, wait until the cluster can give me 8-128 homogeneous nodes at once, or go buy my own. Now run it.

BlueGene Get $$$ or access, learn custom MPI-like computation and

communication working language, determine how I want to handle communication and data movement, re-write my program, wait for configuration or access, re-debug my program, re-run.

5Christopher Moretti – University of Notre Dame4/21/2009

So? Serially Cluster Supercomputer

So I can either take my program as-is and it’ll take forever, or I can do a new custom implementation to a certain particular architecture and re-write and re-debug it every time we upgrade (assuming I’m lucky enough to have a BlueGene in the first place)?

Well what about Condor?

6Christopher Moretti – University of Notre Dame4/21/2009

Yes, what about Condor?

How do I fit my

workload into jobs?

Which resources

?

What happens when things

fail?

How Many?What is Condor?

What do I do with the results?

How can I measure job

stats?

What about job input

data?

How long will it take?

7Christopher Moretti – University of Notre Dame4/21/2009

FFFFFF

Abstractions Fill in the Gap

1CPU Multicore Cluster Condor Supercomputer

Here is my function: F(x,y)Here is a folder of files: set SHere is a static library I need.

set S of files lib.a

binary function F

F

F FFFFFF

FFFF exec exec rfork condor_submit submit

F FF F

FFF

FFFFFF

FFFFFF

FFFFFF

8Christopher Moretti – University of Notre Dame4/21/2009

What is an abstraction?

Abstraction: a declarative specification of the computation and data of a workload.

A restricted pattern, not meant to be a general purpose programming language.

Uses data structures instead of files. Regular structure makes it tractable to model and

predict performance.

Allows a user to repeat the same pattern of work many times, making slight changes to the data and algorithms

9Christopher Moretti – University of Notre Dame4/21/2009

Abstractions Example Approaches

Data management: distribute data only to nodes where it is necessary for computation. Distribute broadcasted data via efficient algorithms. Access data in a memory-efficient order. Use data structures instead of flat files.

Task management: assign appropriate amounts of data per discrete task. Adapt to the environment by choosing nodes showing good performance. Submit/manage tasks that do not overwhelm the batch system.

10Christopher Moretti – University of Notre Dame4/21/2009

The All-Pairs ProblemAll-Pairs(Set S1,Set S2,Function F)yields a matrix M:Mij = F(S1i,S2j)

60K 20KB images >1GB3.6B comparisons

@ 50/s = 2.3 CPUYrsx 8B output = 29GB

1 .8 .1 0 0 .1

1 0 .1 .1 0

1 0 .1 .7

1 0

1 .1

1

F

11Christopher Moretti – University of Notre Dame4/21/2009

All Pairs Abstraction

set S of filesbinary function F

F

M = AllPairs(F,S)

invocation

13Christopher Moretti – University of Notre Dame4/21/2009

Biometrics All-Pairs at Notre Dame

14Christopher Moretti – University of Notre Dame4/21/2009

R[4,2]

R[3,2] R[4,3]

R[4,4]R[3,4]R[2,4]

R[4,0]R[3,0]R[2,0]R[1,0]R[0,0]

R[0,1]

R[0,2]

R[0,3]

R[0,4]

Fx

yd

Fx

yd

Fx

yd

Fx

yd

Fx

yd

Fx

yd

F

F

y

y

x

x

d

d

x F Fx

yd yd

Wavefront Recurrence ( R[x,0], R[0,y], F(x,y,d) )

15Christopher Moretti – University of Notre Dame4/21/2009

Implementing Wavefront

CompleteInput

CompleteOutput

Master

Worker

Worker FInput

Output

Input

Output

F

17Christopher Moretti – University of Notre Dame4/21/2009

Genome Assembly

Bioinformatics sequencers can only extract DNA from samples 50-1000 basepairs (A,C,G,T) at a time.

Biologists need the DNA together in genome profiles of millions of contiguous basepairs.

Genome assembly is the process of putting the pieces of the puzzle back together again in the right configuration. A principal step is “overlapping”.

One way would be a huge All-Pairs problem, but this isn’t necessary. Algorithms exist to extract a sparse matrix of possible candidate pairs (two sequences that might overlap in the right answer). So we must only compute the overlaps for the candidates.

18Christopher Moretti – University of Notre Dame4/21/2009

Seq1 Seq2Seq1 Seq3Seq2 Seq3Seq4 Seq5

>Seq1ATG*CTAG>Seq2A*G*CTGA…

Input Sequence Data

Candidate (Work) List

Master

Worker

WorkQueue:“Align”“>Seq1\nATG*CTAG\n…”

Output Alignment Results(raw format)

Inputdata Align

20Christopher Moretti – University of Notre Dame4/21/2009

Purpose of a Suite of Abstractions

Engineering: Big systems to solve cool problems. Science:

What are common elements of abstractions? What is an intuitive interface for users to adapt their existing serial

solutions to larger problems?data computation

F M = AllPairs(F,S)

invocation

Set S

F R = Wavefront(F,R)Initial State

F O = Overlap(F,C,S)CandsSeqs

21Christopher Moretti – University of Notre Dame4/21/2009

Challenges Remaining with Abstractions

Exploiting a regular pattern doesn’t mean that nothing can go wrong … It’s not just domain scientists who can make mistakes that

lead to disastrous consequences. Sometimes good solutions to problems beget new and

interesting problems. A motivating example to finish …

22Christopher Moretti – University of Notre Dame4/21/2009

starter

Our tasks are done, we don’t need workers anymore!

condor_rm;exit();

starter

starter

starter

master

Universe=vanilla…TransferFiles=always…

23Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Okay, I’ll send back the data the workers

generated.

Okay, I’ll send back the data the workers

generated.

.

.

.

.

.

24Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Here’s the data the worker generated!

25Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Here’s the data the worker generated!

26Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starterHere’s the data the worker

generated!

27Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Here’s the data the worker generated!

ENOSPACE? What?

28Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Ack! We didn’t want those files back, they were

temporary. “TransferFiles=Always”

was a mistake! Now we’re out of space!

29Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

Hrm, I can’t transfer back their files. I guess I’ll hang out and won’t

remove the jobs until I can.

30Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

I’m waiting to finish my condor_rm until I can

transfer files back to the submitting node.

We’re waiting to clean up local state until you kill

your jobs.

31Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

I’m waiting to finish my condor_rm until I can

transfer files back to the submitting node.

We’re waiting to clean up local state until you kill

your jobs.ARGH!

32Christopher Moretti – University of Notre Dame4/21/2009

Moral of the Story

Abstractions are a way to give domain scientists tools that don’t require drastically changing their already-complete solutions, but still allow for efficient HPC/HTC.

Exploiting a regular pattern doesn’t mean that nothing can go wrong.

Interesting challenges remain, among these: Predicting performance on an unpredictable system Monitoring and adapting to fit a changing system Dealing with entangling relationships

e.g. Remote state requires local state Tying together the lessons learned from these abstractions

What do they have in common? Why is one solution right for one problem and wrong for another?

33Christopher Moretti – University of Notre Dame4/21/2009

For More Information Christopher Moretti

cmoretti@cse.nd.edu

Douglas Thain dthain@cse.nd.edu

Cooperative Computing Lab http://cse.nd.edu/~ccl