Computing Outside The Box June 2009

67
1 Ian Foster Computation Institute Argonne National Lab & University of Chicago

description

Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.

Transcript of Computing Outside The Box June 2009

Page 1: Computing Outside The Box June 2009

1

Ian FosterComputation Institute

Argonne National Lab & University of Chicago

Page 2: Computing Outside The Box June 2009

3

Page 3: Computing Outside The Box June 2009

4

“I’ve been doing cloud computing since before it

was called grid.”

Page 4: Computing Outside The Box June 2009

5

1890

Page 5: Computing Outside The Box June 2009

6

1953

Page 6: Computing Outside The Box June 2009

7

“Computation may someday be organized as a public utility …

The computing utility could become the basis for a new and important

industry.”

John McCarthy

(1961)

Page 7: Computing Outside The Box June 2009

8

Page 8: Computing Outside The Box June 2009

9Time

Con

nect

ivity

(on

log

scal

e) Science

“When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances”

(George Gilder, 2001)

Grid

Page 9: Computing Outside The Box June 2009

10

Application

Infrastructure

Page 10: Computing Outside The Box June 2009

11

Layered grid architecture

Application

Fabric“Controlling things locally”: Access to, & control of, resources

Connectivity“Talking to things”: communication (Internet protocols) & security

Resource“Sharing single resources”: negotiating access, controlling use

Collective“Managing multiple resources”: ubiquitous infrastructure services

User“Specialized services”: user- or appln-specific distributed services

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

(“The Anatomy of the Grid,” 2001)

Page 11: Computing Outside The Box June 2009

12

Application

InfrastructureService oriented infrastructure

Page 12: Computing Outside The Box June 2009

13

Page 13: Computing Outside The Box June 2009

14www.opensciencegrid.org

Page 14: Computing Outside The Box June 2009

15www.opensciencegrid.org

Page 15: Computing Outside The Box June 2009

16

Application

InfrastructureService oriented infrastructure

Page 16: Computing Outside The Box June 2009

17

ApplicationService oriented applications

InfrastructureService oriented infrastructure

Page 17: Computing Outside The Box June 2009

18

Page 18: Computing Outside The Box June 2009

19

As of Oct19, 2008:

122 participants105 services

70 data35 analytical

Page 19: Computing Outside The Box June 2009

20

Microarray clustering using Taverna

1. Query and retrieve microarray data from a caArray data service:cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub

2. Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService

1. Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage

Workflow in/output

caGrid services

“Shim” servicesothers

Wei Tan

Page 20: Computing Outside The Box June 2009

21Infrastructure

Applications

Page 21: Computing Outside The Box June 2009

22

Energy

Progress of adoption

Page 22: Computing Outside The Box June 2009

23

Energy

Progress of adoption

$$ $$$$

Page 23: Computing Outside The Box June 2009

24

Energy

Progress of adoption

$$ $$$$

Page 24: Computing Outside The Box June 2009

25Time

Con

nect

ivity

(on

log

scal

e) Science Enterprise

“When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances”

(George Gilder, 2001)

Grid Cloud

Page 25: Computing Outside The Box June 2009

26

Page 26: Computing Outside The Box June 2009

27

Page 27: Computing Outside The Box June 2009

28US$3

Page 28: Computing Outside The Box June 2009

29Credit: Werner Vogels

Page 29: Computing Outside The Box June 2009

30Credit: Werner Vogels

Page 30: Computing Outside The Box June 2009

31

Animoto EC2 image usage

Day 1 Day 8

0

4000

Page 31: Computing Outside The Box June 2009

32

Software

Platform

Infrastructure

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Page 32: Computing Outside The Box June 2009

33

Software

Platform

Infrastructure Amazon, GoGrid, Sun,Microsoft, …

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Page 33: Computing Outside The Box June 2009

34

Software

Platform

Infrastructure Amazon, GoGrid, Sun,Microsoft, …

Amazon, Google,Microsoft, …

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Page 34: Computing Outside The Box June 2009

35

Page 35: Computing Outside The Box June 2009

36

Dynamo: Amazon’s highly available key-value store (DeCandia et al., SOSP’07)

Simple query model Weak consistency,

no isolation Stringent SLAs (e.g.,

300ms for 99.9% of requests; peak 500 requests/sec)

Incremental scalability

Symmetry Decentralization Heterogeneity

Page 36: Computing Outside The Box June 2009

Technologies used in Dynamo

Problem Technique AdvantagePartitioning

Consistent hashing

Incremental scalability

High Availability for writes

Vector clocks with

reconciliation during reads

Version size is decoupled from

update rates

Handling temporary failures

Sloppy quorum and hinted

handoff

Provides high availability and

durability guarantee when some of the replicas are not

availableRecovering from

permanent failures

Anti-entropy using Merkle

trees

Synchronizes divergent replicas in

the background

Membership and failure detection

Gossip-based membership protocol and

failure detection.

Preserves symmetry and avoids having a centralized registry

for storing membership and

node liveness information

Page 37: Computing Outside The Box June 2009

38

ApplicationService oriented applications

InfrastructureService oriented infrastructure

Page 38: Computing Outside The Box June 2009

39

Birmingham•

The Globus-basedLIGO data grid

Replicating >1 Terabyte/day to 8 sites>100 million replicas so farMTBF = 1 month

LIGO Gravitational Wave Observatory

Cardiff

AEI/Golm

Page 39: Computing Outside The Box June 2009

40

Pull “missing” files to a storage system

List of required

Files

GridFTPLocal

ReplicaCatalog

ReplicaLocation

Index

Data Replication

Service

Reliable File

Transfer Service Local

ReplicaCatalog

GridFTP

Data replication service

“Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005

ReplicaLocation

Index

Data MovementData Location

Data Replication

Page 40: Computing Outside The Box June 2009

41

Specializing further …

User

ServiceProvider

“Provide access to data D at S1, S2, S3 with performance P”

ResourceProvider

“Provide storage with performance P1, network with P2, …”

D

S1

S2

S3

D

S1

S2

S3Replica catalog,User-level multicast, …

D

S1

S2

S3

Page 41: Computing Outside The Box June 2009

42

My servers

ChicagoChicago

handle.net

BIRN

Chicago

IaaS provider

Chicago

BIRN

Chicago

Using IaaS in biomedical informatics

Page 42: Computing Outside The Box June 2009

43

Clouds and supercomputers:Conventional wisdom?

Too slow

Too expensive

Clouds/clusters

Supercomputers

Loosely coupledapplications

Tightly coupledapplications

Page 43: Computing Outside The Box June 2009

44Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008.

Page 44: Computing Outside The Box June 2009

45Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008.

Page 45: Computing Outside The Box June 2009

46Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008.

Page 46: Computing Outside The Box June 2009

47Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008.

Page 47: Computing Outside The Box June 2009

48D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from

time series. SIGMETRICS 2007: 379-380

Page 48: Computing Outside The Box June 2009

49D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from

time series. SIGMETRICS 2007: 379-380

Page 49: Computing Outside The Box June 2009

50D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from

time series. SIGMETRICS 2007: 379-380

Page 50: Computing Outside The Box June 2009

51D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from

time series. SIGMETRICS 2007: 379-380

Page 51: Computing Outside The Box June 2009

52

Clouds and supercomputers:Conventional wisdom?

Good for rapid

response

Too expensive

Clouds/clusters

Supercomputers

Loosely coupledapplications

Tightly coupledapplications

Page 52: Computing Outside The Box June 2009

5353

Loosely coupled problems Ensemble runs to quantify climate model uncertainty Identify potential drug targets by screening a database

of ligand structures against target proteins Study economic model sensitivity to parameters Analyze turbulence dataset from many perspectives Perform numerical optimization to determine optimal

resource assignment in energy problems Mine collection of data from advanced light sources Construct databases of computed properties of chemical

compounds Analyze data from the Large Hadron Collider Analyze log data from 100,000-node parallel

computations

Page 53: Computing Outside The Box June 2009

54

Many many tasks:Identifying potential drug targets

2M+ ligands Protein xtarget(s)

(Mike Kubal, Benoit Roux, and others)

Page 54: Computing Outside The Box June 2009

55

start

report

DOCK6Receptor

(1 per protein:defines pocket

to bind to)

ZINC3-D

structures

ligands complexes

NAB scriptparameters

(defines flexibleresidues, #MDsteps)

Amber Score:1. AmberizeLigand3. AmberizeComplex5. RunNABScript

end

BuildNABScript

NABScript

NABScript

Template

Amber prep:2. AmberizeReceptor4. perl: gen nabscript

FREDReceptor

(1 per protein:defines pocket

to bind to)

Manually prepDOCK6 rec file

Manually prepFRED rec file

1 protein(1MB)

6 GB2M

structures(6 GB)

DOCK6FRED ~4M x 60s x 1 cpu~60K cpu-hrs

Amber~10K x 20m x 1 cpu

~3K cpu-hrs

Select best ~500

~500 x 10hr x 100 cpu~500K cpu-hrsGCMC

PDBprotein

descriptions

Select best ~5KSelect best ~5K

For 1 target:4 million tasks

500,000 cpu-hrs(50 cpu-years)

Page 55: Computing Outside The Box June 2009

56

Page 56: Computing Outside The Box June 2009

57

DOCK on BG/P: ~1M tasks on 118,000 CPUs

CPU cores: 118784 Tasks: 934803 Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization:

Sustained: 99.6% Overall: 78.3%

• GPFS

• 1 script (~5KB)

• 2 file read (~10KB)

• 1 file write (~10KB)

• RAM (cached from GPFS on first task per node)

• 1 binary (~7MB)

• Static input data (~45MB)IoanRaicu

ZhaoZhang

MikeWilde

Time (secs)

Page 57: Computing Outside The Box June 2009

58

Managing 160,000 cores

Slower shared storage

High-speed local “disk”

Falkon

Page 58: Computing Outside The Box June 2009

59

Scaling Posix to

petascale

LFS Computenode

(local datasets)

LFS Computenode

(local datasets)

. . .

Largedataset

CN-striped intermediate file system

Torus and tree interconnects

Global file systemChirp(multicast)

MosaStore(striping)

Staging

Intermediate

Local

Page 59: Computing Outside The Box June 2009

60Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors

Page 60: Computing Outside The Box June 2009

61

“Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node

IoanRaicu

Page 61: Computing Outside The Box June 2009

62Same scenario, but with dynamic resource provisioning

Page 62: Computing Outside The Box June 2009

63

Data diffusion sine-wave workload: Summary

GPFS 5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP 1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

Page 63: Computing Outside The Box June 2009

64

Clouds and supercomputers:Conventional wisdom?

Good for rapid

response

Excellent

Clouds/clusters

Supercomputers

Loosely coupledapplications

Tightly coupledapplications

Page 64: Computing Outside The Box June 2009

65

“The computer revolution hasn’t happened yet.”

Alan Kay, 1997

Page 65: Computing Outside The Box June 2009

66Time

Con

nect

ivity

(on

log

scal

e) Science Enterprise Consumer

“When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances”

(George Gilder, 2001)

Grid Cloud ????

Page 66: Computing Outside The Box June 2009

67

Energy InternetThe Shape of Grids to Come?

Page 67: Computing Outside The Box June 2009

Computation Institutewww.ci.uchicago.edu

Thank you!