Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

32
1 Christopher Moretti – University of Notre Dame 4/21/2009 Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

description

Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame. I want to complete big workloads. 3.6B Hamming distance computations, .02 seconds each on a 2GHz dual-core desktop computer, each creating 1 real number output: 2.3 CPUYrs and 29GB of output. - PowerPoint PPT Presentation

Transcript of Abstractions for Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

Page 1: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

1Christopher Moretti – University of Notre Dame4/21/2009

Abstractions for Data-Intensive Computing

on Condor

Christopher Moretti

University of Notre Dame

Page 2: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

3Christopher Moretti – University of Notre Dame4/21/2009

I want to complete big workloads3.6B Hamming distance computations, .02 seconds each

on a 2GHz dual-core desktop computer, each creating 1 real number output: 2.3 CPUYrs and 29GB of output.

85M 1000x1000 dynamic programming tables, .04 seconds each on a 3GHz quad-core Xeon server: 39 CPUDays requiring a total of 77GB of input data.

500x500 recurrence matrix, recurrence functions 7 seconds each on the desktop: 22 CPUDays, not completely independently parallelizable.

… can these be run this week? How about this afternoon? Can we even turn these into “lunchtime” or “coffee refill” problems?

Page 3: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

4Christopher Moretti – University of Notre Dame4/21/2009

How? On my workstation.

Write my program, make sure to make it partitionable, because it takes a really long time and might crash, debug it. Now run it for 39 days – 2.3 years.

On my department’s 128-node research cluster Learn MPI, determine how I want to move many GBs of data

around, re-write my program and re-debug, wait until the cluster can give me 8-128 homogeneous nodes at once, or go buy my own. Now run it.

BlueGene Get $$$ or access, learn custom MPI-like computation and

communication working language, determine how I want to handle communication and data movement, re-write my program, wait for configuration or access, re-debug my program, re-run.

Page 4: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

5Christopher Moretti – University of Notre Dame4/21/2009

So? Serially Cluster Supercomputer

So I can either take my program as-is and it’ll take forever, or I can do a new custom implementation to a certain particular architecture and re-write and re-debug it every time we upgrade (assuming I’m lucky enough to have a BlueGene in the first place)?

Well what about Condor?

Page 5: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

6Christopher Moretti – University of Notre Dame4/21/2009

Yes, what about Condor?

How do I fit my

workload into jobs?

Which resources

?

What happens when things

fail?

How Many?What is Condor?

What do I do with the results?

How can I measure job

stats?

What about job input

data?

How long will it take?

Page 6: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

7Christopher Moretti – University of Notre Dame4/21/2009

FFFFFF

Abstractions Fill in the Gap

1CPU Multicore Cluster Condor Supercomputer

Here is my function: F(x,y)Here is a folder of files: set SHere is a static library I need.

set S of files lib.a

binary function F

F

F FFFFFF

FFFF exec exec rfork condor_submit submit

F FF F

FFF

FFFFFF

FFFFFF

FFFFFF

Page 7: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

8Christopher Moretti – University of Notre Dame4/21/2009

What is an abstraction?

Abstraction: a declarative specification of the computation and data of a workload.

A restricted pattern, not meant to be a general purpose programming language.

Uses data structures instead of files. Regular structure makes it tractable to model and

predict performance.

Allows a user to repeat the same pattern of work many times, making slight changes to the data and algorithms

Page 8: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

9Christopher Moretti – University of Notre Dame4/21/2009

Abstractions Example Approaches

Data management: distribute data only to nodes where it is necessary for computation. Distribute broadcasted data via efficient algorithms. Access data in a memory-efficient order. Use data structures instead of flat files.

Task management: assign appropriate amounts of data per discrete task. Adapt to the environment by choosing nodes showing good performance. Submit/manage tasks that do not overwhelm the batch system.

Page 9: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

10Christopher Moretti – University of Notre Dame4/21/2009

The All-Pairs ProblemAll-Pairs(Set S1,Set S2,Function F)yields a matrix M:Mij = F(S1i,S2j)

60K 20KB images >1GB3.6B comparisons

@ 50/s = 2.3 CPUYrsx 8B output = 29GB

1 .8 .1 0 0 .1

1 0 .1 .1 0

1 0 .1 .7

1 0

1 .1

1

F

Page 10: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

11Christopher Moretti – University of Notre Dame4/21/2009

All Pairs Abstraction

set S of filesbinary function F

F

M = AllPairs(F,S)

invocation

Page 11: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame
Page 12: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

13Christopher Moretti – University of Notre Dame4/21/2009

Biometrics All-Pairs at Notre Dame

Page 13: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

14Christopher Moretti – University of Notre Dame4/21/2009

R[4,2]

R[3,2] R[4,3]

R[4,4]R[3,4]R[2,4]

R[4,0]R[3,0]R[2,0]R[1,0]R[0,0]

R[0,1]

R[0,2]

R[0,3]

R[0,4]

Fx

yd

Fx

yd

Fx

yd

Fx

yd

Fx

yd

Fx

yd

F

F

y

y

x

x

d

d

x F Fx

yd yd

Wavefront Recurrence ( R[x,0], R[0,y], F(x,y,d) )

Page 14: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

15Christopher Moretti – University of Notre Dame4/21/2009

Implementing Wavefront

CompleteInput

CompleteOutput

Master

Worker

Worker FInput

Output

Input

Output

F

Page 15: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame
Page 16: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

17Christopher Moretti – University of Notre Dame4/21/2009

Genome Assembly

Bioinformatics sequencers can only extract DNA from samples 50-1000 basepairs (A,C,G,T) at a time.

Biologists need the DNA together in genome profiles of millions of contiguous basepairs.

Genome assembly is the process of putting the pieces of the puzzle back together again in the right configuration. A principal step is “overlapping”.

One way would be a huge All-Pairs problem, but this isn’t necessary. Algorithms exist to extract a sparse matrix of possible candidate pairs (two sequences that might overlap in the right answer). So we must only compute the overlaps for the candidates.

Page 17: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

18Christopher Moretti – University of Notre Dame4/21/2009

Seq1 Seq2Seq1 Seq3Seq2 Seq3Seq4 Seq5

>Seq1ATG*CTAG>Seq2A*G*CTGA…

Input Sequence Data

Candidate (Work) List

Master

Worker

WorkQueue:“Align”“>Seq1\nATG*CTAG\n…”

Output Alignment Results(raw format)

Inputdata Align

Page 18: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame
Page 19: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

20Christopher Moretti – University of Notre Dame4/21/2009

Purpose of a Suite of Abstractions

Engineering: Big systems to solve cool problems. Science:

What are common elements of abstractions? What is an intuitive interface for users to adapt their existing serial

solutions to larger problems?data computation

F M = AllPairs(F,S)

invocation

Set S

F R = Wavefront(F,R)Initial State

F O = Overlap(F,C,S)CandsSeqs

Page 20: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

21Christopher Moretti – University of Notre Dame4/21/2009

Challenges Remaining with Abstractions

Exploiting a regular pattern doesn’t mean that nothing can go wrong … It’s not just domain scientists who can make mistakes that

lead to disastrous consequences. Sometimes good solutions to problems beget new and

interesting problems. A motivating example to finish …

Page 21: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

22Christopher Moretti – University of Notre Dame4/21/2009

starter

Our tasks are done, we don’t need workers anymore!

condor_rm;exit();

starter

starter

starter

master

Universe=vanilla…TransferFiles=always…

Page 22: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

23Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Okay, I’ll send back the data the workers

generated.

Okay, I’ll send back the data the workers

generated.

.

.

.

.

.

Page 23: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

24Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Here’s the data the worker generated!

Page 24: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

25Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Here’s the data the worker generated!

Page 25: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

26Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starterHere’s the data the worker

generated!

Page 26: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

27Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Here’s the data the worker generated!

ENOSPACE? What?

Page 27: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

28Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

starter

starter

starter

Ack! We didn’t want those files back, they were

temporary. “TransferFiles=Always”

was a mistake! Now we’re out of space!

Page 28: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

29Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

Hrm, I can’t transfer back their files. I guess I’ll hang out and won’t

remove the jobs until I can.

Page 29: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

30Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

I’m waiting to finish my condor_rm until I can

transfer files back to the submitting node.

We’re waiting to clean up local state until you kill

your jobs.

Page 30: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

31Christopher Moretti – University of Notre Dame4/21/2009

shadow

schedd

starter

I’m waiting to finish my condor_rm until I can

transfer files back to the submitting node.

We’re waiting to clean up local state until you kill

your jobs.ARGH!

Page 31: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

32Christopher Moretti – University of Notre Dame4/21/2009

Moral of the Story

Abstractions are a way to give domain scientists tools that don’t require drastically changing their already-complete solutions, but still allow for efficient HPC/HTC.

Exploiting a regular pattern doesn’t mean that nothing can go wrong.

Interesting challenges remain, among these: Predicting performance on an unpredictable system Monitoring and adapting to fit a changing system Dealing with entangling relationships

e.g. Remote state requires local state Tying together the lessons learned from these abstractions

What do they have in common? Why is one solution right for one problem and wrong for another?

Page 32: Abstractions for  Data-Intensive Computing on Condor Christopher Moretti University of Notre Dame

33Christopher Moretti – University of Notre Dame4/21/2009

For More Information Christopher Moretti

[email protected]

Douglas Thain [email protected]

Cooperative Computing Lab http://cse.nd.edu/~ccl