Grid Computing July 2009

57
Grid computing Ian Foster Computation Institute Argonne National Lab & University of Chicago

description

I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.

Transcript of Grid Computing July 2009

Page 1: Grid Computing July 2009

Grid computingIan Foster

Computation Institute

Argonne National Lab & University of Chicago

Page 2: Grid Computing July 2009

2

“When the network is as fast as the

computer’s internal links, the machine

disintegrates across the net into a set of

special purpose appliances”

(George Gilder, 2001)

Page 3: Grid Computing July 2009

3

“I’ve been doing cloud computing since before it

was called grid.”

Page 4: Grid Computing July 2009

4

“Computation may someday be organized as a public utility …

The computing utility could become the basis for a new and important

industry.”

John McCarthy

(1961)

Page 5: Grid Computing July 2009

5

Scientific collaboration

Scientific collaboration

Page 6: Grid Computing July 2009

6

Addressing urban health

needs

Page 7: Grid Computing July 2009

7

Important characteristics

We must integrate systems that may not have worked together before

These are human systems, with differing goals, incentives, capabilities

All components are dynamic—change is the norm, not the exception

Processes evolve rapidly also

We are not building something simple like a

bridge or an airline reservation system

Page 8: Grid Computing July 2009

8

We are dealing withcomplex adaptive systems

A complex adaptive system is a collection of individual

agents that have the freedom to act in ways that are not

always predictable and whose actions are interconnected

such that one agent’s actions changes the context

for other agents.

Crossing the Quality Chasm, IOM, 2001; pp 312-13

Non-linear and dynamic Agents are independent

and intelligent Goals and behaviors

often in conflict Self-organization through

adaptation and learning No single point(s) of

control Hierarchical decomp-

osition has limited value

Page 9: Grid Computing July 2009

9

Ralph Stacey, Complexity and Creativity in Organizations, 1996

Low

LowHigh

High

Agreementabout

outcomes

Certainty about outcomes

We need to function in the zone of complexity

Plan and

control

Chaos

Zone of

complexity

Page 10: Grid Computing July 2009

10

Ralph Stacey, Complexity and Creativity in Organizations, 1996

Low

LowHigh

High

Agreementabout

outcomes

Certainty about outcomes

We need to function in the zone of complexity

Plan and

control

Chaos

Page 11: Grid Computing July 2009

11

“The Anatomy of the Grid,” 2001 The … problem that underlies the Grid concept is

coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource -brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).

Page 12: Grid Computing July 2009

12

Examples (from AotG, 2001)

“The application service providers, storage service providers, cycle providers, and consultants engaged by a car manufacturer to perform scenario evaluation during planning for a new factory”

“Members of an industrial consortium bidding on a new aircraft”

“A crisis management team and the databases and simulation systems that they use to plan a response to an emergency situation”

“Members of a large, international, multiyear high-energy physics collaboration”

Page 13: Grid Computing July 2009

13From the organizational behavior and management community

“[A] group of people who interact through interdependent tasks guided by common purpose [that] works across space, time, and organizational boundaries with links strengthened by webs of communication technologies”

— Lipnack & Stamps, 1997

Yes—but adding cyber-infrastructure: People computational agents & services Communication technologies IT infrastructure

Collaboration based on rich data & computing capabilities

Page 14: Grid Computing July 2009

14

NSF Workshops on

Building Effective Virtual

Organizations

[Search “BEVO 2008”]

Page 15: Grid Computing July 2009

15

The Grid paradigm

1995 2000 2005 2010

Principles and mechanisms for dynamic VOs Leverage service oriented architecture (SOA) Loose coupling of

data and services Open software,

architecture

Computer science

Physics

Astronomy

Engineering

Biology

Biomedicine

Healthcare

Page 16: Grid Computing July 2009

16We call these groupingsvirtual organizations (VOs)

Healthcare = dynamic, overlapping VOs, linking Patient – primary care Sub-specialist – hospital Pharmacy – laboratory Insurer – …

A set of individuals and/or institutions engaged in the controlled sharing of

resources in pursuit of a common goal

But U.S. health system is marked by

fragmented and inefficient VOs with

insufficient mechanisms for

controlled sharing

I advocate … a model of virtual integration rather than true vertical integration … G. Halvorson, CEO Kaiser

Page 17: Grid Computing July 2009

17

The Grid paradigm and information integration

Radiology Medical records

Name resources; move data around

Make resources usable and useful

Make resources accessible over the network

Pathology Genomics Labs

Man

ag

e w

ho ca

n d

o w

hat

RHIOData

sources

Platform services

Page 18: Grid Computing July 2009

18

The Grid paradigm and information integration

Transform data into knowledge

Radiology Medical records

Management

Integration

Publication

Enhance user cognitive processes

Incorporate into business processes

Pathology Genomics Labs

Secu

rity a

nd

policy

RHIOData

sources

Platform services

Page 19: Grid Computing July 2009

19

The Grid paradigm and information integration

Analysis

Radiology Medical records

Management

Integration

Publication

Cognitive support

Applications

Pathology Genomics Labs

Secu

rity a

nd

policy

RHIOData

sources

Platform services

Value services

Page 20: Grid Computing July 2009

20

We partition the multi-faceted interoperability problem

Process interoperability Integrate work across healthcare

enterprise Data interoperability

Syntactic: move structured data among system elements

Semantic: use information across system elements

Systems interoperability Communicate securely, reliably

among system elements

Analysis

Management

Integration

Publication

Applications

Page 21: Grid Computing July 2009

21

Security and policy:Managing who can do what

Familiar division of labor

Publication level: bridge between local and global

Integration level: VO-specific policies, based on attributes

Attribute authorities

Page 22: Grid Computing July 2009

Identity-based authZMost simple - not scalable

Unix Access Control Lists (Discretionary Access Control: DAC)

Groups, directories, simple admin

POSIX ACLs/MS-ACLs

Finer-grained admin policy

Role-based Access Control (RBAC)

Separation of role/group from rule admin

Mandatory Access Control (MAC)

Clearance, classification, compartmentalization

Attribute-based Access Control (ABAC)

Generalization of attributes

>>> Policy language abstraction level and expressiveness >>>

>>> Policy language abstraction level and expressiveness >>>

Page 23: Grid Computing July 2009

23

Globus / caGrid GAARDS

Page 24: Grid Computing July 2009

24

Publication:Make information accessible

Make data available in a remotely accessible, reusable manner

Leave mediation for integration layer

Gateway from local policy/protocol into wide area mechanisms (transport, security, …)

Page 25: Grid Computing July 2009

25

TeraGrid participants

Page 26: Grid Computing July 2009

26Federating computers for physics data analysis

Page 27: Grid Computing July 2009

27

Page 28: Grid Computing July 2009

28

Main ESG PortalMain ESG Portal CMIP3 (IPCC AR4) ESG PortalCMIP3 (IPCC AR4) ESG Portal

198 TB of data at four locations 1,150 datasets 1,032,000 files Includes the past 6 years of joint

DOE/NSF climate modeling experiments

35 TB of data at one location 74,700 files Generated by a modeling campaign coordinated by the

Intergovernmental Panel on Climate Change Data from 13 countries, representing 25 models

8,000 registered users 1,900 registered projects

Downloads to date 49 TB 176,000 files

Downloads to date 387 TB 1,300,000 files 500 GB/day

(average)

400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data

Earth System Grid

ESG usage: over 500 sites worldwide

ESG monthly download volumes

Globus

Page 29: Grid Computing July 2009

29

En

terp

rise/G

ridIn

terfa

ce se

rvice

DICOMprotocols

Grid protocols

(Web services)

DICOM

XDS

HL7

Vendor-specific

Wid

e a

rea

serv

ice a

ctor

Plug-in adapters

Children’s Oncology Group

Page 30: Grid Computing July 2009

30

ApplnService

Create

Index service

StoreRepository ServiceAdvertize

Discover

Invoke;get results

Introduce

Container

Transfer GAR

Deploy

caGrid, Introduce, gRAVI: Ohio State, U.Chicago

Automating service creation, deployment

Introduce Define service Create skeleton Discover types Add operations Configure security

Grid Remote Application Virtualization Infrastructure Wrap executables

Page 31: Grid Computing July 2009

31

As of Oct19, 2008:

122 participants105 services

70 data35 analytical

Page 32: Grid Computing July 2009

32

Management:Naming and moving information

Persistent, uniform global naming of

objects, independent of type

Orchestration of data movement among

services

D

S1

S2

S3

D

S1

S2

S3

D

S1

S2

S3

Page 33: Grid Computing July 2009

33

Birmingham•

LIGO Data Grid

Replicating >1 Terabyte/day to 8 sites770 TB replicated to date: >120 million replicasMTBF = 1 month

LIGO Gravitational Wave Observatory

Cardiff

AEI/Golm

Ann Chervenak et al., ISI; Scott Koranda et al, LIGO

Globus

Page 34: Grid Computing July 2009

34

Pull “missing” files to a storage system

List of required

Files

GridFTPLocal

ReplicaCatalog

ReplicaLocation

Index

Data Replication

Service

Reliable File

Transfer Service Local

ReplicaCatalog

GridFTP

Data replication service

“Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005

ReplicaLocation

Index

Data movementData location

Data replication

Page 35: Grid Computing July 2009

35

Naming objects:A prerequisite to management

The naming problem: “Health objects” =

patient information, images, records, etc.

“Names” refer to health objects in records, files, databases, papers, reports, research, emails, etc.

Challenges: No systematic way of

naming health objects Many health objects,

like DICOM images and reports, include references to other objects through non-unique, ambiguous, PHI-tainted identifiers

A framework for distributed digital object services: Kahn, Wilensky, 1995

Page 36: Grid Computing July 2009

36

Health Object Identifier (HOI)naming system

uri:hdl://888.us.npi.1234567890.dicom/8A648C33-A5…4939EBE

Random String for Identifier-Body

PHI-free and guaranteed unique

Random String for Identifier-Body

PHI-free and guaranteed unique

888: CHI’s top-level naming

authority

888: CHI’s top-level naming

authority

National Provider Id used in hierarchical Identifier

Namespace

National Provider Id used in hierarchical Identifier

Namespace

Application Context’s Namespace governed by provider Naming Authority

Application Context’s Namespace governed by provider Naming Authority

HOI’s URI schema identifier—based on

Handle

HOI’s URI schema identifier—based on

Handle

Page 37: Grid Computing July 2009

37

Data movement in clinical trials

Page 38: Grid Computing July 2009

38Community public health:Digital retinopathy screening network

Page 39: Grid Computing July 2009

39

Integration:Making information useful

?

0% 100% Degree of prior syntactic and semantic agreement

Degree of communication

0%

100%

Rigid standards-based approach

Loosely coupled approach

Adaptive approach

Page 40: Grid Computing July 2009

40

Integration via mediation

Map between models Scoped to domain use

Multiple concurrent use

Bottom up mediation Between standards and

versions Between local versions In absence of

agreement

Query Reformulation

Query Optimization

Query Execution Engine

Wrapper

Query in the source schema

Wrapper

Query in union of exportedsource schema

Distributed query execution

Global Data Model

(Levy 2000)

Page 41: Grid Computing July 2009

41

ECOG 5202 integrated sample management

ECOGPCO

MD Anderson

Web portal

OGSA-DQP

OGSA-DAI OGSA-DAI OGSA-DAI

Mediator

ECOG CC

Page 42: Grid Computing July 2009

42

Analytics:Transform data into knowledge

“The overwhelming success of genetic and genomic research efforts has created an enormous backlog of data with the potential to improve the quality of patient care and cost effectiveness of treatment.”

— US Presidential Council of Advisors on Science and Technology, Personalized Medicine Themes, 2008

Page 43: Grid Computing July 2009

43Microarray clustering using Taverna

1. Query and retrieve microarray data from a caArray data service:cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub

2. Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService

1. Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage

Workflow in/output

caGrid services

“Shim” servicesothers

Wei Tan

Page 44: Grid Computing July 2009

44

Many many tasks:Identifying potential drug targets

2M+ ligands Protein xtarget(s)

(Mike Kubal, Benoit Roux, and others)

Page 45: Grid Computing July 2009

45

start

report

DOCK6Receptor

(1 per protein:defines pocket

to bind to)

ZINC3-D

structures

ligands complexes

NAB scriptparameters

(defines flexibleresidues,

#MDsteps)

Amber Score:1. AmberizeLigand

3. AmberizeComplex5. RunNABScript

end

BuildNABScript

NABScript

NABScript

Template

Amber prep:2. AmberizeReceptor4. perl: gen nabscript

FREDReceptor

(1 per protein:defines pocket

to bind to)

Manually prepDOCK6 rec file

Manually prepFRED rec file

1 protein(1MB)

6 GB2M

structures(6 GB)

DOCK6FRED~4M x 60s x 1 cpu

~60K cpu-hrs

Amber~10K x 20m x 1 cpu

~3K cpu-hrs

Select best ~500

~500 x 10hr x 100 cpu~500K cpu-hrsGCMC

PDBprotein

descriptions

Select best ~5KSelect best ~5K

For 1 target:4 million tasks

500,000 cpu-hrs(50 cpu-years)

Page 46: Grid Computing July 2009

46DOCK on BG/P: ~1M tasks on 118,000 CPUs

CPU cores: 118784 Tasks: 934803 Elapsed time:

7257 sec Compute time:

21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to

32 racks) Utilization:

Sustained: 99.6% Overall: 78.3%

Time (secs)

Page 47: Grid Computing July 2009

47Scaling Posix to petascale

LFS Computenode

(local datasets)

LFS Computenode

(local datasets)

. . .

Largedataset

CN-striped intermediate file system

Torus and tree interconnects

Global file systemChirp(multicast)

MosaStore (striping)

Staging

Inter-mediate

Local

Page 48: Grid Computing July 2009

48

Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors

Page 49: Grid Computing July 2009

49

“Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node

IoanRaicu

Page 50: Grid Computing July 2009

50

Same scenario, but with dynamic resource provisioning

Page 51: Grid Computing July 2009

51

Data diffusion sine-wave workload: Summary

GPFS 5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP 1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

Page 52: Grid Computing July 2009

52

Recap

Increased recognition that information systems and data understanding are limiting factor… much of the promise associated with health IT requires high

levels of adoption … and high levels of use of interoperable systems (in which information can be exchanged across unrelated systems) …. RAND COMPARE

Health system is complex, adaptive system There is no single point(s) of control. System behaviors are often

unpredictable and uncontrollable, and no one is “in charge.” W Rouse, NAE Bridge

With diverse and evolving requirements and user communities… I advocate … a model of virtual integration rather than true

vertical integration…. G. Halvorson, CEO Kaiser

Page 53: Grid Computing July 2009

53

Ralph Stacey, Complexity and Creativity in Organizations, 1996

Low

LowHigh

High

Agreementabout

outcomes

Certainty about outcomes

Functioning in the zone of complexity

Plan and

control

Chaos

Page 54: Grid Computing July 2009

54

The Grid paradigm and information integration

Analysis

Radiology Medical records

Management

Integration

Publication

Cognitive support

Applications

Pathology Genomics Labs

Secu

rity a

nd

policy

RHIOData

sources

Platform services

Value services

Page 55: Grid Computing July 2009

55

“The computer revolution hasn’t happened yet.”

Alan Kay, 1997

Page 56: Grid Computing July 2009

56

TimeCon

nect

ivit

y (

on log

sca

le)

Science Enterprise Consumer

“When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances”

(George Gilder, 2001)

Grid Cloud ????

Page 57: Grid Computing July 2009

Computation Institutewww.ci.uchicago.edu

Thank you!