Download - Cooperative Computing for Data Intensive Science

Transcript

Cooperative Computing for Data Intensive Science

Douglas ThainUniversity of Notre Dame

NSF Bridges to Engineering 2020 Conference12 March 2008

What is Cooperative Computing?

• By combining our computing and storage resources together, we can attack problems larger than we could alone.

• I can use your computer when it is idle, and vice versa. (Most computers are idle about 90 percent of the day.)

• Also known as…– Grid computing, distributed computing,

metacomputing, volunteer computing, etc…

Who Needs Coop Computing?

• Many fields of study rely on simulation and data processing to conduct science.– Physics, chemistry, biology, engineering, finance,

sociology, computer science.

• More Computing == Better Results– NOT High Performance: Speed up one program.– High Throughput: Produce as many results as

possible over the next day / week / year.

Cooperative Computing Lab

• We design and build distributed systems that helps people to attack BIG problems.

• Work directly with end users to make sure that our solutions affect the real world.

• Operate a modest computing system as both a production service and a research testbed.– Currently about 500 cpus and 300 disks.

• CS Research challenges: scalability, robustness, usability, debugging, and performance.

http://www.nd.edu/~ccl

What Makes this Challenging?

• The Programming Model– I want to process 10 TB of data on 100

machines, then distribute it across 20 disks, then view the best results on my workstation.

• Fault Tolerance– Something is always broken!

• Performance Robustness– There is always one slowpoke.

• Debugging– My job runs correctly here but not there...!?

An Example Collaboration:

Biometrics Researchand

Distributed Systems

A Common Pattern in Biometrics

1 .8 .1 0 0 .1

1 0 .1 .1 0

1 0 .1 .3

1 0 0

1 .1

1

F

Sample Workload:4000 images256KB each1s per F185 CPU-days

Future Workload:60000 images1MB each0.1s per F4166 CPU-days

Non-Expert User Using 500 CPUsTry 1: Each F is a batch job.Failure: Dispatch latency >> F runtime.

HN

CPU CPU CPU CPUF F F FCPUF

Try 2: Each row is a batch job.Failure: Too many small ops on FS.

HN

CPU CPU CPU CPUF F F FCPUFFFF FF

FFFF

FFF

FFF

Try 3: Bundle all files into one package.Failure: Everyone loads 1GB at once.

HN

CPU CPU CPU CPUF F F FCPUFFFF FF

FFFF

FFF

FFF

Try 4: User gives up and attemptsto solve an easier or smaller problem.

All Pairs Production System

Web Portal300 active storage units500 CPUs, 40TB disk

F G H

S T

All-PairsEngine

2 - AllPairs(F,S)

F F F

F F F

3 - O(log n) distributionby spanning tree.

6 - Return resultmatrix to user.

1 - Upload F and Sinto web portal.

5 - Collect andassemble results.

4 – Choose optimal partitioningand submit batch jobs.

Some Results on Real Workload

Collaboration is Where the Interesting Problems Are!

(Cooperative ComputingProvides the Resources)

What Makes a Collaboration Work?

• Like a marriage? (old joke.)

• First, a show of commitment: go after some low hanging fruit, and publish it.

• A proposal for funding only succeeds if you have already started working together.

• Need very concrete goals: your partner may not share your idea of an interesting tangent.

• Students sometimes need a big push to leave their comfort zone and work together.

For more information…

• Douglas Thain– [email protected]

• Cooperative Computing Lab– http://www.nd.edu/~ccl

• Apply for Summer 2008 REU:

–http://www.nd.edu/~ccl/reu

Supported by NSF Grants CCF-0621434 and CNS-0643229.