Cooperative Computing for Data Intensive Science
Douglas ThainUniversity of Notre Dame
NSF Bridges to Engineering 2020 Conference12 March 2008
What is Cooperative Computing?
• By combining our computing and storage resources together, we can attack problems larger than we could alone.
• I can use your computer when it is idle, and vice versa. (Most computers are idle about 90 percent of the day.)
• Also known as…– Grid computing, distributed computing,
metacomputing, volunteer computing, etc…
Who Needs Coop Computing?
• Many fields of study rely on simulation and data processing to conduct science.– Physics, chemistry, biology, engineering, finance,
sociology, computer science.
• More Computing == Better Results– NOT High Performance: Speed up one program.– High Throughput: Produce as many results as
possible over the next day / week / year.
Cooperative Computing Lab
• We design and build distributed systems that helps people to attack BIG problems.
• Work directly with end users to make sure that our solutions affect the real world.
• Operate a modest computing system as both a production service and a research testbed.– Currently about 500 cpus and 300 disks.
• CS Research challenges: scalability, robustness, usability, debugging, and performance.
http://www.nd.edu/~ccl
What Makes this Challenging?
• The Programming Model– I want to process 10 TB of data on 100
machines, then distribute it across 20 disks, then view the best results on my workstation.
• Fault Tolerance– Something is always broken!
• Performance Robustness– There is always one slowpoke.
• Debugging– My job runs correctly here but not there...!?
A Common Pattern in Biometrics
1 .8 .1 0 0 .1
1 0 .1 .1 0
1 0 .1 .3
1 0 0
1 .1
1
F
Sample Workload:4000 images256KB each1s per F185 CPU-days
Future Workload:60000 images1MB each0.1s per F4166 CPU-days
Non-Expert User Using 500 CPUsTry 1: Each F is a batch job.Failure: Dispatch latency >> F runtime.
HN
CPU CPU CPU CPUF F F FCPUF
Try 2: Each row is a batch job.Failure: Too many small ops on FS.
HN
CPU CPU CPU CPUF F F FCPUFFFF FF
FFFF
FFF
FFF
Try 3: Bundle all files into one package.Failure: Everyone loads 1GB at once.
HN
CPU CPU CPU CPUF F F FCPUFFFF FF
FFFF
FFF
FFF
Try 4: User gives up and attemptsto solve an easier or smaller problem.
All Pairs Production System
Web Portal300 active storage units500 CPUs, 40TB disk
F G H
S T
All-PairsEngine
2 - AllPairs(F,S)
F F F
F F F
3 - O(log n) distributionby spanning tree.
6 - Return resultmatrix to user.
1 - Upload F and Sinto web portal.
5 - Collect andassemble results.
4 – Choose optimal partitioningand submit batch jobs.
What Makes a Collaboration Work?
• Like a marriage? (old joke.)
• First, a show of commitment: go after some low hanging fruit, and publish it.
• A proposal for funding only succeeds if you have already started working together.
• Need very concrete goals: your partner may not share your idea of an interesting tangent.
• Students sometimes need a big push to leave their comfort zone and work together.
For more information…
• Douglas Thain– [email protected]
• Cooperative Computing Lab– http://www.nd.edu/~ccl
• Apply for Summer 2008 REU:
–http://www.nd.edu/~ccl/reu
Supported by NSF Grants CCF-0621434 and CNS-0643229.
Top Related