Chapter 4:- Introduction to Grid and its Evolution Prepared By:- NITIN PANDYA Assistant Professor...

Click here to load reader

download Chapter 4:- Introduction to Grid and its Evolution Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

of 60

Transcript of Chapter 4:- Introduction to Grid and its Evolution Prepared By:- NITIN PANDYA Assistant Professor...

Chapter 4:- Introduction to Grid and its Evolution

Chapter 4:-Introduction to Grid and its EvolutionPrepared By:- NITIN PANDYA Assistant Professor SVBIT.2OverviewBackground: What is the Grid?Related technologiesGrid applicationsCommunitiesGrid ToolsCase Studies3What is a Grid?Many definitions exist in the literatureEarly defs: Foster and Kesselman, 1998A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational facilitiesKleinrock 1969: We will probably see the spread of computer utilities, which, like present electric and telephone utilities, will service individual homes and offices across the country.Grid computing (1)

Coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations (I. Foster)

Grid computing (2)Information gridlarge access to distributed data (the Web)

Data gridmanagement and processing of very large distributed data sets

Computing gridmeta computerParallelism vs grids: some recalls Grids date back only 1996 Parallelism is older ! (first classification in 1972) Motivations:need more computing power (weather forecast, atomic simulation, genomics)need more storage capacity (Petabytes and more)in a word: improve performance ! 3 ways ...

Work harder-->Use faster hardwareWork smarter-->Optimize algorithmsGet help-->Use more computers !6The performance ? Ideally it grows linearly Speed-up:if TS is the best time to process a problem sequentially,then the parallel processing time should be TP=TS/P with P processorsspeedup = TS/TP

the speedup is limited by Amdhal law: any parallel program has a purely sequential and a parallelizable part TS= F + T//, thus the speedup is limited: S = (F + T//) / (F + (T///P)) < P

Scale-up:if TPS is the time to solve a problem of size S with P processors, then TPS should also be the time to process a problem of size n*S with n*P processors78Why do we need Grids?Many large-scale problems cannot be solved by a single computerGlobally distributed data and resources9Background: Related technologiesCluster computingPeer-to-peer computingInternet computing10Cluster computingIdea: put some PCs together and get them to communicateCheaper to build than a mainframe supercomputerDifferent sizes of clustersScalable can grow a cluster by adding more PCs11Cluster Architecture

12Peer-to-Peer computingConnect to other computersCan access files from any computer on the networkAllows data sharing without going through central serverDecentralized approach also useful for Grid13Peer to Peer architecture

14Internet computingIdea: many idle PCs on the InternetCan perform other computations while not being usedCycle scavenging rely on getting free time on other peoples computersExample: SETI@homeWhat are advantages/disadvantages of cycle scavenging?15Some Grid ApplicationsDistributed supercomputingHigh-throughput computingOn-demand computingData-intensive computingCollaborative computing16Grid UsersMany levels of usersGrid developersTool developersApplication developersEnd usersSystem administrators17Some Grid challengesData movementData replicationResource managementJob submissionComputational gridHardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities (I. Foster)

Performance criteria:securityreliabilitycomputing powerlatencythroughputscalabilityservicesGrid characteristicsLarge scaleHeterogeneityMultiple administration domainAutonomy and coordinationDynamicityFlexibilityExtensibilitySecurity

Levels of cooperation in a computing gridEnd system (computer, disk, sensor)multithreading, local I/O

Clustersynchronous communications, DSM, parallel I/Oparallel processing

Intranet/Organizationheterogeneity, distributed admin, distributed FS and databasesload balancingaccess control

Internet/Gridglobal supervisionbrokers, negotiation, cooperation

Basic servicesAuthentication/Authorization/Traceability

Activity control (monitoring)

Resource discovery

Resource brokering

Scheduling

Job submission, data access/migration and execution

AccountingLayered Grid Architecture(By Analogy to Internet Architecture)ApplicationFabricControlling things locally: Access to, & control of, resourcesConnectivityTalking to things: communication (Internet protocols) & securityResourceSharing single resources: negotiating access, controlling useCollectiveCoordinating multiple resources: ubiquitous infrastructure services, app-specific distributed servicesInternetTransportApplicationLinkInternet Protocol ArchitectureFrom I. FosterElements of the ProblemResource sharingComputers, storage, sensors, networks, Heterogeneity of device, mechanism, policySharing conditional: negotiation, payment,

Coordinated problem solvingIntegration of distributed resourcesCompound quality of service requirements

Dynamic, multi-institutional virtual orgsDynamic overlays on classic organization structuresMap to underlying control mechanisms

From I. FosterResourcesDescriptionAdvertisingCatalogingMatchingClaimingReservingCheckpointingResource management (1)Services and protocols depend on the infrastructure

Some parametersstability of the infrastructure (same set of resources or not)freshness of the resource availability informationreservation facilitiesmultiple resource or single resource brokering

Example of request: I need from 10 to 100 CE each with at least 512 MB RAM and a computing power of 150 MflopsResource management and scheduling (1)Levels of schedulingjob scheduling (global level ; perf: throughput)resource scheduling (perf: fairness, utilization)application scheduling (perf: response time, speedup, produced data)

Mapping/Scheduling processresource discovery and selectionassignment of tasks to computing resourcesdata distributiontask scheduling on the computing resources(communication scheduling)Resource management and scheduling (2)Individual perfs are not necessarily consistent with the global (system) perf !

Grid problemspredictions are not definitive: dynamicity !Heterogeneous platformsCheckpointing and migrationGRAMGRAMGRAMLSFCondorNQEApplicationRSLSimple ground RSLInformation ServiceLocalresourcemanagersRSLspecializationBrokerGround RSLCo-allocatorQueries& InfoA Resource Management System Example (Globus)NQE: Network Queuing Env.(batch management; developedby Cray ResearchLSF: Load Sharing Facility(task scheduling and load balancing; Developed by Platform Computing)Resource Specification LanguageResource information (1)What is to be stored ?virtual organizations, people, computing resources, software packages, communication resources, event producers, deviceswhat about data ???

A key issue in such dynamics environments

A first approach : (distributed) directory (LDAP)easy to usetree structuredistributionstaticmostly read ; not efficient updatinghierarchicalpoor procedural languageResource information (2)Goal:dynamicitycomplex relationshipsfrequent updatescomplex queries

A second approach: (relational) database

Programming on the grid: potential programming modelsMessage passing (PVM, MPI)Distributed Shared MemoryData Parallelism (HPF, HPC++)Task Parallelism (Condor)Client/server - RPCAgentsIntegration system (Corba, DCOM, RMI)Program execution: issuesParallelize the program with the right job structure, communication patterns/procedures, algorithms

Discover the available resources

Select the suitable resources

Allocate or reserve these resources

Migrate the data

Initiate computations

Monitor the executions ; checkpoints ?

React to changes

Collect results

Data managementIt was long forgotten !!!Though it is a key issue !Issues:indexingretrievalreplicationcachingtraceability(auditing)And security !!!34Some Grid-Related ProjectsGlobusCondorNimrod-G35Globus Grid ToolkitOpen source toolkit for building Grid systems and applicationsEnabling technology for the Grid Share computing power, databases, and other tools securely online Facilities for:Resource monitoringResource discoveryResource managementSecurityFile management 36Data Management in Globus ToolkitData movementGridFTPReliable File Transfer (RFT)Data replicationReplica Location Service (RLS)Data Replication Service (DRS)37GridFTPHigh performance, secure, reliable data transfer protocolOptimized for wide area networksSuperset of Internet FTP protocolFeatures:Multiple data channels for parallel transfersPartial file transfersThird party transfersReusable data channelsCommand pipelining38More GridFTP featuresAuto tuning of parametersStripingTransfer data in parallel among multiple senders and receivers instead of just oneExtended block modeSend data in blocksKnow block size and offsetData can arrive out of orderAllows multiple streams39Striping ArchitectureUse Striped servers

40Limitations of GridFTPNot a web service protocol (does not employ SOAP, WSDL, etc.)Requires client to maintain open socket connection throughout transferInconvenient for long transfersCannot recover from client failures41GridFTP

42Reliable File Transfer (RFT)Web service with job-scheduler functionality for data movementUser provides source and destination URLsService writes job description to a database and moves filesService methods for querying transfer status43RFT

44Replica Location Service (RLS)Registry to keep track of where replicas exist on physical storage systemUsers or services register files in RLS when files createdDistributed registryMay consist of multiple servers at different sitesIncrease scaleFault tolerance45Replica Location Service (RLS)Logical file name unique identifier for contents of filePhysical file name location of copy of file on storage systemUser can provide logical name and ask for replicasOr query to find logical name associated with physical file location

46Data Replication Service (DRS)Pull-based replication capabilityImplemented as a web serviceHigher-level data management service built on top of RFT and RLSGoal: ensure that a specified set of files exists on a storage siteFirst, query RLS to locate desired filesNext, creates transfer request using RFTFinally, new replicas are registered with RLS47CondorOriginal goal: high-throughput computingHarvest wasted CPU power from other machinesCan also be used on a dedicated clusterCondor-G Condor interface to Globus resources48Earth System GridProvide climate studies scientists with access to large datasetsData generated by computational models requires massive computational powerMost scientists work with subsets of the dataRequires access to local copies of data49ESG InfrastructureArchival storage systems and disk storage systems at several sitesStorage resource managers and GridFTP servers to provide access to storage systemsMetadata catalog servicesReplica location servicesWeb portal user interface50Earth System Grid

51Earth System Grid Interface

52Laser Interferometer Gravitational Wave Observatory (LIGO)Instruments at two sites to detect gravitational wavesEach experiment run produces millions of filesScientists at other sites want these datasets on local storageLIGO deploys RLS servers at each site to register local mappings and collect info about mappings at other sites

53Large Scale Data Replication for LIGOGoal: detection of gravitational wavesThree interferometers at two sitesGenerate 1 TB of data dailyNeed to replicate this data across 9 sites to make it available to scientistsScientists need to learn where data items are, and how to access them54LIGO

55LIGO SolutionLightweight data replicator (LDR)Uses parallel data streams, tunable TCP windows, and tunable write/read buffersTracks where copies of specific files can be found Stores descriptive information (metadata) in a database Can select files based on description rather than filename56TeraGridNSF high-performance computing facilityNine distributed sites, each with different capability , e.g., computation power, archiving facilities, visualization softwareApplications may require more than one siteData sizes on the order of gigabytes or terabytes57TeraGrid

58TeraGridSolution: Use GridFTP and RFT with front end command line tool (tgcp)Benefits of system:Simple user interface High performance data transfer capability Ability to recover from both client and server software failuresExtensible configuration59TGCP DetailsIdea: hide low level GridFTP commands from usersCopy file smallfile.dat in a working directory to another system:tgcp smallfile.dat tg-login.sdsc.teragrid.org:/users/ux454332 GridFTP command:globus-url-copy -p 8 -tcp-bs 1198372 \gsiftp://tg-gridftprr.uc.teragrid.org:2811/home/navarro/smallfile.dat \gsiftp://tg-login.sdsc.teragrid.org:2811/users/ux454332/smallfile.dat 60The realityWe have spent a lot of time talking about The GridThere is the Web and the InternetIs there a single Grid?61The realityMany types of Grids existPrivate vs. publicRegional vs. GlobalAll-purpose vs. particular scientific problem