Grid and e-Science Technologies

Post on 12-Jan-2016

29 views 0 download

Tags:

description

Grid and e-Science Technologies. Simon Cox Technical Director Southampton Regional e-Science Centre. Summary. The Grid problem : Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations - PowerPoint PPT Presentation

Transcript of Grid and e-Science Technologies

Grid and e-Science Technologies

Simon CoxTechnical Director

Southampton Regional e-Science Centre

Summary The Grid problem: Resource sharing & coordinated problem

solving in dynamic, multi-institutional virtual organizations Grid architecture: Protocol, service definition for

interoperability & resource sharing Grid Middleware

Globus Toolkit a source of protocol and API definitions—and reference implementations

Open Grid Services Architecture represents next step in evolution Condor High throughput Computing Web Services & W3C leveraging e-business

e-Science Projects applying Grid concepts to applications

Grid Computing

The Grid Problem“Flexible, secure, coordinated resource sharing among

dynamic collections of individuals, institutions, and resource”

- “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke

Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals - assuming the absence of … central location central control omniscience existing trust

Why Grids? (1) e-Science A biochemist exploits 10,000 computers to screen

100,000 compounds in an hour 1,000 physicists worldwide pool resources for peta-op

analyses of petabytes of data Civil engineers collaborate to design, execute, &

analyze shake table experiments Climate scientists visualize, annotate, & analyze

terabyte simulation datasets An emergency response team couples real time data,

weather model, population data

Grid Communities & Applications:Data Grids for High Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

www.griphyn.org www.ppdg.net www.eu-datagrid.org

Network for EarthquakeEngineering Simulation

NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other

On-demand access to experiments, data streams, computing, archives, collaboration

NEESgrid: Argonne, Michigan, NCSA, UIUC, USC

DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago

tomographic reconstruction

real-timecollection

wide-areadissemination

desktop & VR clients with shared controls

Advanced Photon Source

Online Access to Scientific Instruments

archival storage

Why Grids? (2) e-Business Engineers at a multinational company collaborate on

the design of a new product A multidisciplinary analysis in aerospace couples code

and data in four companies An insurance company mines data from partner

hospitals for fraud detection An application service provider offloads excess load to

a compute cycle provider An enterprise configures internal & external resources

to support e-Business workload

Grids: Why Now?

Moore’s law highly functional end-systems Ubiquitous Internet universal connectivity Network exponentials produce dramatic changes in

geometry and geography 9-month doubling: double Moore’s law! 1986-2001: x340,000; 2001-2010: x4000?

New modes of working and problem solving emphasize teamwork, computation

New business models and technologies facilitate outsourcing

Elements of the Problem Resource sharing

Computers, storage, sensors, networks, … Heterogeneity of device, mechanism, policy Sharing conditional: negotiation, payment, …

Coordinated problem solving Integration of distributed resources Compound quality of service requirements

Dynamic, multi-institutional virtual organisations Dynamic overlays on classic org structures Map to underlying control mechanisms

http://www.globus.org/research/papers/anatomy.pdf

The Grid World: Current Status Dozens of major Grid projects in scientific & technical

computing/research & education Deployment, application, technology

Some consensus on key concepts and technologies Open source Globus Toolkit™ a de facto standard for major

protocols & services Far from complete or perfect, but out there, evolving rapidly,

and large tool/user base

Global Grid Forum a significant force Industrial interest emerging rapidly

http://www.gridforum.org

Grid Middleware(coordinate and authenticate use of grid services)

Globus (and GGF grid-computing protocols) Security Infrastructure (GSI) Resource Allocation Mechanism (GRAM) Resource Information System (GRIS) Index Information Service (GIIS) Grid-FTP Metadirectory service (MDS 2.0+) coupled to LDAP server

Condor (distributed high performance throughput system) Condor-G allows us to handle dispatching jobs to our Globus system Active collaboration from with the Condor development team at

University of Wisconsin (Miron Livny)

The Globus ProjectMaking Grid computing a reality

Close collaboration with real Grid projects in science and industry Development and promotion of standard Grid protocols to enable

interoperability and shared infrastructure Development and promotion of standard Grid software APIs and SDKs

to enable portability and code sharing The Globus Toolkit: Open source, reference software base for building

grid infrastructure and applications Global Grid Forum: Development of standard protocols and APIs for

Grid computing

http://www.gridforum.orghttp://www.globus.org

Four Key Protocols The Globus Toolkit centers around four key protocols

Connectivity layer: Security: Grid Security Infrastructure (GSI)

Resource layer: Resource Management: Grid Resource Allocation Management

(GRAM) Information Services: Grid Resource Information Protocol

(GRIP) Data Transfer: Grid File Transfer Protocol (GridFTP)

User

Userprocess #1

Proxy

Authenticate & create proxy

credential

GSI(Grid

Security Infrastruc-

ture)

Gatekeeper(factory)

Reliable remote

invocation

GRAM(Grid Resource Allocation & Management)

Reporter(registry +discovery)

Userprocess #2Proxy #2

Create process Register

The Globus Toolkit in One Slide Grid protocols (GSI, GRAM, …) enable resource sharing within virtual orgs; toolkit provides reference implementation ( = Globus Toolkit services)

Protocols (and APIs) enable other tools and services for membership, discovery, data mgmt, workflow, …

Other service(e.g. GridFTP)

Other GSI-authenticated remote service

requests

GIIS: GridInformationIndex Server (discovery)

MDS-2(Meta Directory Service)

Soft stateregistration;

enquiry

Globus Toolkit: Evaluation (+)

Good technical solutions for key problems, e.g. Authentication and authorization Resource discovery and monitoring Reliable remote service invocation High-performance remote data access

This + good engineering is enabling progress Good quality reference implementation, multi-language

support, interfaces to many systems, large user base, industrial support

Growing community code base built on tools

Globus Toolkit: Evaluation (-) Protocol deficiencies, e.g.

Heterogeneous basis: HTTP, LDAP, FTP No standard means of invocation, notification, error

propagation, authorization, termination, …

Significant missing functionality, e.g. Databases, sensors, instruments, workflow, … Virtualization of end systems (hosting envs.)

Little work on total system properties, e.g. Dependability, end-to-end QoS, … Reasoning about system properties

4

http:/ / www.cs.wisc.edu/ condor

The Condor Project (Established ‘85)Distributed High Throughput Computing researchperformed by a team of ~25 faculty, full t ime staffand students who:

h f ace sof tware engineering challenges in adistributed UNI X/ Linux/ NT environment,

h are involved in national and internationalcollaborations,

h actively interact with academic and commercialusers,

h maintain and support a large distributedproduction environment,

h and educate and train students.Funding – US Govt. (DoD, DoE, NASA, NSF),AT&T, IBM, I NTEL, Microsoft UW- Madison

What is Condor? Condor converts collections of distributively owned

workstations and dedicated clusters into a distributed high-throughput computing facility.

Condor uses ClassAd Matchmaking to make sure that everyone is happy.

Features Unix and NT Operational since 1986 Manages more than 1300 CPUs at UW-Madison Software available free on the web More than 150 Condor installations worldwide in academia

and industry Non-dedicated resources Job checkpoint and migration

What is High-Throughput Computing?

High-performance: CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this machine?”

High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. “How fast can I run simulation X on this machine?” “How many times can I run simulation X in the next

month using all available machines?”

Some HTC Challenges Condor does whatever it takes to run your jobs,

even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else

What is ClassAd Matchmaking? Condor uses ClassAd Matchmaking to make sure

that work gets done within the constraints of both users and owners.

Users (jobs) have constraints: “I need an Alpha with 256 MB RAM”

Owners (machines) have constraints: “Only run jobs when I am away from my desk and

never run jobs owned by Bob.”

Condor Pool Architecture

Mathematicians Solve NUG30 Looking for the solution to

the NUG30 quadratic assignment problem

An informal collaboration of mathematicians and computer scientists

Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites)

14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23

MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin

What Is Condor-G? Enhanced version of Condor that provides robust job

management for Globus Toolkit Robust replacement for globusrun Provides extensive fault-tolerance Brings Condor’s job management features to Globus jobs

Two Parts Globus Universe GlideIn

Excellent example of applying the general purpose Globus Toolkit to solve a particular problem (i.e. high-throughput computing) on the Grid

Why Use Condor-G Condor

Designed to run jobs within a single administrative domain

Globus Toolkit Designed to run jobs across many administrative domains

Condor-G Combine the strengths of both

Web Services Increasingly popular standards-based framework for accessing network

applications W3C standardization; Microsoft, IBM, Sun, others

XML and XML Schema Representing data in a portable format

WSDL: Web Services Description Language Interface Definition Language for Web services

SOAP: Simple Object Access Protocol XML-based RPC protocol; common WSDL target

WSDL (/ WS-Inspection) Conventions for locating service descriptions

UDDI: Universal Description, Discovery, & Integration Directory for Web services

Structure Structure (XML Schemas)(XML Schemas)

XML Web Services FrameworkXML Web Services Framework

WireWire DescriptionDescription DiscoveryDiscovery

Syntax (XML)Syntax (XML)

Envelope & Envelope & Extensibility Extensibility

(SOAP)(SOAP)InspectionInspection

(DISCO)(DISCO)

Directory (UDDI)Directory (UDDI)Service Service

DescriptionDescription(WSDL)(WSDL)

Process Process OrchestrationOrchestration

(XLANG)(XLANG)AttachmentsAttachments

W3C W3C RecRec

W3C WGW3C WG

FutureFuture

SecuritySecurity

RoutingRouting

ReliabilityReliability

ServiceServiceDescriptionDescription

(WSDL)(WSDL)

ProcessProcessOrchestrationOrchestration

(XLANG)(XLANG)

“New” GlobusOpen Grid Services Architecture (OGSA)

Service orientation to virtualize resources From Web services:

Standard interface definition mechanisms: multiple protocol bindings, multiple implementations, local/remote transparency

Building on Globus Toolkit: Grid service: semantics for service interactions Management of transient instances (& state) Factory, Registry, Discovery, other services Reliable and secure transport

Multiple hosting targets: J2EE, .NET, “C”, …http://www.globus.org/research/papers/ogsa.pdf

http://www.globus.org/research/papers/gsspec.pdf

OGSA Service Model System comprises (a typically few) persistent services &

(potentially many) transient services All services adhere to specified Grid service interfaces and

behaviours Reliable invocation, lifetime management, discovery,

authorization, notification, upgradeability, concurrency, manageability

Interfaces for managing Grid service instances Factory, registry, discovery, lifetime, etc.

=> Reliable, secure management of distributed state

Using OGSAto Construct Grid Environments

Factory RegistryService

FactoryH2R

Mapper

ServiceService Service ...

...

(a) Simple HostingEnvironment

Factory RegistryService

FactoryH2R

Mapper

ServiceService Service ...

...

F R

F M

SS S

F R

F M

SS S

(b) Virtual HostingEnvironment

E2EFactory

E2E Reg

E2E H2RMapper

...

F1

R

M

SS S

F2

R

M

SS S

E2E S E2E S E2E S

(c) Compound Services

In each case, Registry handle is effectively the uniquename for the virtual organization.

Evolution of Globus Initial exploration (1996-1999; Globus 1.0)

Extensive application experiments; core protocols

Data Grids (1999-??; Globus 2.0+) Large-scale data management and analysis

Open Grid Services Architecture (2001-??, Globus 3.0) Integration with Web services, hosting environments,

resource virtualization Databases, higher-level services

Radically scalable systems (2003-??) Sensors, wireless, ubiquitous computing

Summary The Grid problem: Resource sharing & coordinated problem

solving in dynamic, multi-institutional virtual organizations Grid architecture: Protocol, service definition for

interoperability & resource sharing Grid Middleware

Globus Toolkit a source of protocol and API definitions—and reference implementations

Open Grid Services Architecture represents next step in evolution Condor High throughput Computing Web Services & W3C leveraging e-business

e-Science Projects applying Grid concepts to applications