Autonomic Grid Computing Autonomics for eScience · 2017. 10. 30. · Autonomic Grid Computing...

Autonomic Grid Computing Tutorial, eScience '07, Bangalore, India 12/10/07

Manish Parashar & Omer Rana 1

Autonomic Grid Computing Autonomics for eScience

Manish Parashar Rutgers University, USA

&Omer Rana

Cardiff University, UK

Pervasive Computation Ecosystems & Next Generation eScience/Engineering

• Pervasive Computational Ecosystems (Pervasive Grids)– Addressing applications in an end-to-end manner!– On-demand/ad-hoc access to geographically distributed computing,

communication and information resources• Computers, networks, data archives, instruments, observatories,

experiments, sensors/actuators, ambient information, etc.

• Knowledge-based, information/data-driven, context/content-aware computationally intensive, pervasive applications– Symbiotically and opportunistically combine services/computations,

real-time information, experiments, observations, and to manage, control, predict, adapt, optimize, …

• Crisis management, monitor and predict natural phenomenon, monitor and manage engineered systems, optimize business processes



Autonomic Grid Ecosystems and Knowledge-based Dynamic Information-driven Investigation

Information-driven Management of Subsurface Geosystems: The Instrumented Oil Field (with UT-CSM, UT-IG, OSU, UMD, ANL)

Detect and track changes in data during production.Invert data for reservoir properties.Detect and track reservoir changes.

Assimilate data & reservoir properties intothe evolving reservoir model.

Use simulation and optimization to guide future production.

Data Driven

ModelDriven



Management of Subsurface/Near Surface Geosystems

LandfillsLandfills OilfieldsOilfields

ModelsModels SimulationSimulation

DataDataControlControlUndergroundPollution

UndergroundPollution

UnderseaReservoirsUnderseaReservoirs

Vision: Diverse Geosystems – Similar Solutions



Management of the Ruby Gulch Waste Repository (with UT-CSM, INL, OU)

– Flowmeter at bottom of dump– Weather-station– Manually sampled chemical/air

ports in wells– Approx 40K measurements/day

• Ruby Gulch Waste Repository/Gilt Edge Mine, South Dakota – ~ 20 million cubic yard of

waste rock– AMD (acid mine drainage)

impacting drinking water supplies

• Monitoring System– Multi electrode resistivity system

(523)• One data point every 2.4

seconds from any 4 electrodes – Temperature & Moisture sensors

in four wells

“Towards Dynamic Data-Driven Management of the Ruby Gulch Waste Repository,” M. Parashar, et al, DDDAS Workshop, ICCS 2006, Reading, UK, LNCS, Springer Verlag, Vol. 3993, pp. 384 – 392, May 2006.

Inverse Modeling

System responses

Parameters,Boundary & Initial Conditions

Forward Modeling

Prediction

ComparisonWith observations

Networkdesign

Applicationgoodbad

Adaptive Fusion of Stochastic Information for Imaging Fractured Vadose Zones

• Near-Real Time Monitoring, Characterization and Prediction of Flow Through Fractured Rocks



Data-Driven Forest Fire Simulation (U of AZ)

• Predict the behavior and spread of wildfires (intensity, propagation speed and direction, modes of spread) – based on both dynamic and

static environmental and vegetation conditions

– factors include fuel characteristics and configurations, chemical reactions, balances between different modes of hear transfer, topography, and fire/atmosphere interactions.

“Self-Optimizing of Large Scale Wild Fire Simulations,” J. Yang*, H. Chen*, S. Hariri and M. Parashar, Proceedings of the 5th International Conference on Computational Science (ICCS 2005), Atlanta, GA, USA, Springer-Verlag, May 2005.

Scientific Understanding

Observations

Forecasts

The Instrumented Ocean Ecosystem



NJ Turnpike Delaware River Bridge

Focus of Fatigue Study

System for Laser Treatment of Cancer – UT, Austin

Source: L. Demkowicz, UT Austin



Synthetic Environment for Continuous Experimentation – Purdue University

Source: A. Chaturvedi, Purdue Univ.

Structural Health Monitoring and Critical Event Prediction – Stanford University

Pilot Display of Crisis

On-board Sensing & Processing

n-line Prognosis & Processing

DamageDetection

Material CharacterizationHigh-Performance Off-Line

Prognosis by Modeling and Simulation

On-Demand Maintenance

Detailed Report

Data for Model Update

Pilot Display of CrisisPilot Display of Crisis





DamageDetection

Material Characterization

Material CharacterizationHigh-Performance Off-Line

Prognosis by Modeling and SimulationHigh-Performance Off-Line

Prognosis by Modeling and Simulation


Detailed Report


Detailed Report

Data for Model Update

Source: C. Farhat, SU



Integrated Wireless Phone Based Emergency Response System – Notre Dame

• Detect abnormal patterns in mobile call activity and locations• Initiate dynamic data driven simulations to predict the evolution of

the abnormality• Initiate higher resolution data collection in localities of interest• Interface with emergency response Decision Support Systems

Source: G. Madey, ND

Auto-Steered Information-Decision Processes for Electrical Systems Asset Management – Iowa State Univ.

• Develop a hardware-software prototype for auto-steering the information-decision cycles inherent to managing operations, maintenance, & planning of high-voltage electric power transmission systems.

Remote ControlHeadquarters

WideArea

Network

SecureAccess

SatelliteEthernet Radio Optical Fiber Power Line Internet

2.4 or 5 GHz ISM Band

Wireless Sensor

Network Transceiver

Local Communication Box

NetworkSwitch

Base Station

Port Server WAN

Connection

Data Processor

Source: J. McCalley, Iowa State



Many Application Areas ….

• Hazard prevention, mitigation and response– Earthquakes, hurricanes, tornados, wild fires, floods, landslides, tsunamis, terrorist

attacks• Critical infrastructure systems

– Condition monitoring and prediction of future capability• Transportation of humans and goods

– Safe, speedy, and cost effective transportation networks and vehicles (air, ground, space)

• Energy and environment– Safe and efficient power grids, safe and efficient operation of regional collections

of buildings• Health

– Reliable and cost effective health care systems with improved outcomes• Enterprise-wide decision making

– Coordination of dynamic distributed decisions for supply chains under uncertainty• Next generation communication systems

– Reliable wireless networks for homes and businesses• … … … …

• Report of the Workshop on Dynamic Data Driven Applications Systems, F. Darema et al., March 2006, www.dddas.org

Source: M. Rotea, NSF

Some Application Characteristics (Vectors)

• Close to loop! – acquire, assimilate, analyze/predict, control, ….

• Manage uncertainty– Incomplete knowledge, system/application dynamics, ….

• Opportunistic– applications do not depend on the availability of the

information, but can opportunistically use information if it is available

• Compositional– multiple (independent) application entities cooperate with

each other to collectively get the end-to-end capabilities

• Control– Provide sensor steering/autonomic control capabilities using

actuation devices



The Research Challenge: Addressing Uncertainty!

• System Uncertainty– Very large scales– Ad hoc structures/behaviors– Dynamics– Heterogeneous

• capability, connectivity, reliability, guarantees, QoS

– Lack of guarantees– Lack of common/complete

knowledge

• Information Uncertainty– Availability, resolution, quality of

information– Devices capability, operation,

calibration– Trust in data, data models – Semantics

• Application Uncertainty– Dynamic behaviors

• space-time adaptivity– Dynamic and complex couplings– Dynamic and complex

(opportunistic) interactions– Software/systems engineering

issues• Emergent rather than by design

• Must be addressed at multiple levels– Algorithms

• Asynchronous/chaotic, failure tolerant, …• Uncertainty estimation/characterization (probabilistic

collocation)

– Programming systems• Adaptive, application/system aware, proactive, …

– Infrastructure• Decoupled, self-managing, resilient, …

Pervasive Grid Computing – Research Issues, Opportunities• Programming systems/models for data integration and runtime self-

management– components and compositions capable of adapting behavior, interactions

and information– correctness, consistency, performance, quality-of-service constraints

• Content-based asynchronous and decentralized discovery and access services– semantics, metadata definition, indexing, querying, notification

• Data management mechanisms for data acquisition and transport with real time, space and data quality constraints– high data volumes/rates, heterogeneous data qualities, sources – in-network aggregation, integration, assimilation, caching

• Runtime execution services that guarantee correct, reliable execution with predictable and controllable response time– data assimilation, injection, adaptation

• Security, trust, access control, data provenance, audit trails, accounting



Organization Organization Organization

Grid Middleware Infrastructure

Virtual Organization

Applications

Programming System(Programming Model + Abstract Machine)

Virtualization(Create and manage VOs and VMs)

Abstraction(Support abstract behaviors/assumptions)

Virtual Machine Virtual Machine Virtual Machine

Organization Organization OrganizationOrganizationOrganization OrganizationOrganization OrganizationOrganization

Grid Middleware Infrastructure

Virtual Organization

Applications

Programming System(Programming Model + Abstract Machine)





Virtual Machine Virtual Machine Virtual Machine

Programming Pervasive Grid Systems

– Programming System• programming model,

languages/abstraction – syntax + semantics

– entities, operations, rules of composition, models of coordination/communication

• abstract machine – execution context and assumptions

• infrastructure, middleware and runtime

– Hide v/s expose uncertainty?• virtual homogeneity, ease of

programming • robustness • cross-layer adaptation and

optimization– the inverted stack …

“Conceptual and Implementation Models for the Grid,” M. Parashar and J.C. Browne, Proceedings of the IEEE, Special Issue on Grid Computing, IEEE Press, Vol. 19, No. 3, March 2005.

Project AutoMate: Autonomics for Applications

• Conceptual models and implementation architectures – programming systems based on popular programming models– asynchronous computational engines– content-based coordination and messaging middleware– amorphous and emergent overlays

• WWW: http://automate.rutgers.edu

Rudder

Coordination

Middlew

are

Sesame/D

AIS P

rotection Service

Autonomic Grid Applications

Programming SystemAutonomic Components, Dynamic Composition,

Opportunistic Interactions, Collaborative Monitoring/Control

Decentralized Coordination EngineAgent Framework,

Decentralized Reactive Tuple Space

Semantic Middleware ServicesContent-based Discovery, Associative Messaging

Content OverlayContent-based Routing Engine,

Self-Organizing Overlay

Ont

olog

y, T

axon

omy

Met

eor/S

quid

Con

tent

-bas

edM

iddl

ewar

e

Acc

ord

Prog

ram

min

gFr

amew

ork

“AutoMate: Enabling Autonomic Grid Applications,” M. Parashar et al, Cluster Computing: The Journal of Networks, Software Tools, and Applications, Vol. 9, No. 2, pp. 161 – 174, 2006.



Efforts @ TASSL

• Programming Abstractions and Systems (GridMap/iZone, Accord, Seine, Salsa)– Discovering, querying, interacting with, and controlling instrumented physical

systems– adapt behaviors and interactions/compositions of applications

components/services

• Data Quality and Uncertainty Management– Online characterization of information quality and estimation of uncertainty,

so that it can effectively drive decision making

• Autonomic Infrastructure (CometG, Meteor, Squid)– Asynchronous computation, communication and coordination– Data acquisition, wide-area streaming and in-transit manipulation– Content based data/service discovery and composition– Decoupled associative messaging– Application-driven dynamic management of sensor/actuator systems

• overlay management, iso-clustering, computation/communication/power tradeoffs

Accord: Rule-Based Programming System

• Accord is a programming system which supports the development of autonomic applications.– Enables definition of autonomic components with programmable

behaviors and interactions.– Enables runtime composition and autonomic management of these

components using dynamically defined rules.• Dynamic specification of adaptation behaviors using rules.• Enforcement of adaptation behaviors by invoking sensors and

actuators.• Runtime conflict detection and resolution.

• 3 Prototypes: Object-based, Components-based (CCA), Service-based (Web Service)

“Accord: A Programming Framework for Autonomic Applications,” H. Liu* and M. Parashar, IEEE Transactions on Systems, Man and Cybernetics, Special Issue on Engineering Autonomic Systems, IEEE Press, Vol. 36, No 3, pp. 341 – 352, 2006.



Autonomic Element/Behaviors in Accord

Element Manager

Functional Port

Autonomic Element

Control Port

Operational Port

ComputationalElement

Element Manager

Functional Port

Autonomic Element

Control Port

Operational Port


Element Manager

Event generation

Actuatorinvocation

OtherInterface

invocation

Internalstate

Contextualstate

Rules

Element Manager

Event generation

Actuatorinvocation

OtherInterface

invocation

Internalstate

Contextualstate

Rules

Application workflow

Composition manager

Application strategiesApplication requirements

Interaction rules

Interaction rules

Interaction rules

Interaction rules

Behavior rules

Behavior rules

Behavior rules

Behavior rules

Application workflow

Composition manager

Application strategiesApplication requirements

Interaction rules

Interaction rules

Interaction rules

Interaction rules

Behavior rules

Behavior rules

Behavior rules

Behavior rules

LLC Based Self Management within Accord

• Element/Service Managers are augmented with LLC Controllers– monitors state/execution context of elements– enforces adaptation actions determined by the controller– augment human defined rules

Self-Managing Element


LLC Controller

Element Manager

InternalState

ContextualState

OptimizationFunction

LLC Controller

Element Manager

Advice Computational Element Model



For Example: Optimizing Via Component Replacement

For Example: Optimizing Via Component Adaptation



For Example: Healing Via Component Replacement

Decentralized (Decoupled/Asynchronous) Content-based Middleware Services

Pervasive Grid Environment

Self-Managing Overlay (SquidTON)



The Instrumented Oil Field of the Future (UT-CSM, UT-IG, RU, OSU, UMD, ANL)

Detect and track changes in data during productionInvert data for reservoir propertiesDetect and track reservoir changes

Assimilate data & reservoir properties intothe evolving reservoir model

Use simulation and optimization to guide future production, future data acquisition strategy

• Production of oil and gas can take advantage of installed sensors that will monitor the reservoir’s state as fluids are extracted

• Knowledge of the reservoir’s state during production can result in better engineering decisions

– economical evaluation; physical characteristics (bypassed oil, high pressure zones); productions techniques for safe operating conditions in complex and difficult areas

“Application of Grid-Enabled Technologies for Solving Optimization Problems in Data-Driven Reservoir Studies,” M. Parashar, H. Klie, U. Catalyurek, T. Kurc, V. Matossian, J. Saltz and M Wheeler, FGCS. The International Journal of Grid Computing: Theory, Methods and Applications (FGCS), Elsevier Science Publishers, Vol. 21, Issue 1, pp 19-26, 2005.

Effective Oil Reservoir Management: Well Placement/Configuration

• Why is it important – Better utilization/cost-effectiveness of existing reservoirs– Minimizing adverse effects to the environment

Better Management

Less Bypassed Oil

Bad Management

Much Bypassed Oil



Optimize

• Economic revenue

• Environmental hazard

• …

Based on the present subsurface knowledge and numerical model

Improve numerical model

Plan optimal data acquisition

Acquire remote sensing data

Improve knowledge of subsurface to

reduce uncertainty

Update knowledge of modelM

anag

emen

t dec

isio

n

START

Dynamic Decision Dynamic Decision SystemSystem

Dynamic DataDynamic Data--Driven Assimilation Driven Assimilation

Data assimilation

Subsurface characterization

Experimental design

Autonomic Autonomic Grid Grid

MiddlewareMiddleware

Grid Data ManagementGrid Data ManagementProcessing MiddlewareProcessing Middleware

Dynamic Data-Driven Reservoir Management: “Closing the Loop” using Optimization

An Autonomic Well Placement/Configuration Workflow

If guess not in DBinstantiate IPARS

with guess asparameter

Send guesses

MySQLDatabase

If guess in DB:send response to Clientsand get new guess fromOptimizer

OptimizationService

IPARSFactory

SPSA

VFSA

ExhaustiveSearch

DISCOVERclient

client

Generate Guesses Send GuessesStart Parallel

IPARS InstancesInstance connects to

DISCOVER

DISCOVERNotifies ClientsClients interact

with IPARS

AutoMate Programming System/Grid Middleware

History/ Archived

Data

Sensor/ContextData

Oil prices, Weather, etc.



AutonomicAutonomic OilOil Well Placement/Configuration

permeability Pressure contours3 wells, 2D profile

Contours of NEval(y,z,500)(10)

Requires NYxNZ (450)evaluations. Minimumappears here.

VFSA solution: “walk”: found after 20 (81) evaluations

AutonomicAutonomic OilOil Well Placement/Configuration (VFSA)

“An Reservoir Framework for the Stochastic Optimization of Well Placement,” V. Matossian, M. Parashar, W. Bangerth, H. Klie, M.F. Wheeler, Cluster Computing: The Journal of Networks, Software Tools, and Applications, Kluwer Academic Publishers, Vol. 8, No. 4, pp 255 – 269, 2005 “Autonomic Oil Reservoir Optimization on the Grid,” V. Matossian, V. Bhat, M. Parashar, M. Peszynska, M. Sen, P. Stoffa and M. F. Wheeler, Concurrency and Computation: Practice and Experience, John Wiley and Sons, Volume 17, Issue 1, pp 1 – 26, 2005.



Wide Area Data Streaming in the Fusion Simulation Project (FSP)

• Wide Area Data Streaming Requirements – Enable high-throughput, low latency data transfer to support near real-time

access to the data– Minimize related overhead on the executing simulation– Adapt to network conditions to maintain desired QoS– Handle network failures while eliminating data loss.

GTC Runs on Teraflop/Petaflop Supercomputers

Data archiving

Data replication

Large data analysis

End-to-end system with monitoring routines

Data replication

User monitoring

Post processing

40Gbps

User monitoring

Visualization

Autonomic Data Streaming for Coupled Fusion Simulation Workflows

TransferSimulation

Storage

ORNL, TN

AnalysisStorage

PPPL, NJNERSC, CAData Grid Middleware

& Logistical NetworkingBackbone

Simulation : Simulation Service (SS)

Analysis/Viz :Data Analysis Service (DAS)

Storage : Data Storage Service (DSS)

Transfer : Autonomic Data Streaming Service (ADSS)

– Buffer Manager Service (BMS)

– Data Transfer Service (DTS)

Data TransferBuffer

Autonomic Data Streaming Service

(ADSS)

Managed Service

• Grid-based coupled fusion simulation workflow– Run for days between ORNL (TN), NERSC (CA), PPPL (NJ) and

Rutgers (NJ) – Generates multi-terabytes of data– Data coupled, analyzed and visualized at runtime



ADSS Implementation

DTSBMS

Bandwidth Measurement

Buffer Size Measurement

SS

Buffer size Prediction

Data Rate Measurement

Optimization Function

LLC Model

Local HD

LLC Controller

Element/Service Manager

ADSS

Contextual StateService State

Rule Based Adaptation Grid Middleware {LN}

Backbone

Rule Base

Data Rate Prediction

DTSBMS

Bandwidth Measurement

Buffer Size Measurement

SS

Buffer size Prediction

Data Rate Measurement

Optimization Function

LLC Model

Local HD

LLC Controller

Element/Service Manager

ADSS

Contextual StateService State

Rule Based Adaptation Grid Middleware {LN}

Backbone

Rule Base

Data Rate Prediction

Adaptive Data Transfer

• No congestion in intervals 1-9 – Data transferred over WAN

• Congested at intervals 9-19 – Controller recognizes this congestion and advises the “Element Manager” which in

turn adapts DTS to transfer data to local storage (LAN).• Adaptation continues until the network is not congested

– Data sent to the local storage by the DTS falls to zero at the 19th controller interval.

Controller Interval0 2 4 6 8 10 12 14 16 18 20 22 24

Dat

a Tr

ansf

errr

ed b

y D

TS(M

B)

0

20

40

60

80

100

120

140

Ban

dwid

th (M

b/se

c)

0

20

40

60

80

100

120

DTS to WANDTS to LANBandwidthCongestion



Adaptive Buffer Management

• Uniform buffer management is used when data generation is constant • Aggregate buffer management is triggered when congestion increases

Controller Interval0 2 4 6 8 10 12 14 16 18 20 22 24 26

Blo

cks

Gro

uped

by

BM

S

0

1

2

3

4

5

6

7

8

Ban

dwid

th (M

b/se

c)

0

20

40

60

80

100

120

Blocks 1 Block = 4MB

Congestion

Bandwidth

Uniform Buffer Management

Aggregate Buffer Management

Adaptation of the Workflow

• Create multiple instances of the Autonomic Data Streaming Service (ADSS)

– Effective Network Transfer Rate dips below the Threshold (our case around 100Mbs)

– Maximum number of instances is contained within a certain limit

Data Generation Rate (Mbps)0 20 40 60 80 100 120 140 160

% N

etw

ork

thro

ughp

ut

0

20

40

60

80

100N

umbe

r of A

DSS

Inst

ance

s

0

1

2

3

4

5

% Network throughput vs MbpsNumber of ADSS Instances vs Mbps

TransferSimulation

ADSS-0 Data TransferBuffer

Data TransferBuffer

Data TransferBuffer

ADSS-1

ADSS-2

% Network throughput is difference between the max and current network transfer rate



Autonomic Data Streaming and In-transit Data Manipulation: Two Level Cooperative Self-Management

• First Level– Application level data streaming

• Proactive QoS management based on online control and user defined policies

• Second Level– Adaptive runtime management strategies for in-transit data manipulation

• Quick, opportunistic and reactive– Adaptive buffering and processing techniques performed at each in-transit node

• Masks idle time of in-transit data by performing additional computations in the data pipeline

In-Transit node

Data Consumer

or Sink

Data Producer

Process Buffer

Forward

In-Transit Level Reactive

Management

Simulation

LLC Controller

Service Manager

Application Level Proactive Management

BufferData

Blocks

Data Blocks

Data Consumer

Data Transfer

Coupling

Summary of Results

• Experimental setup – “in-transit” nodes (clusters)

• Rutgers (“Frea”) and PPPL (“Portalx” and “Gridn”)– Data generation

• NERSC (“Seaborg”) and ORNL (“RAM”)

• Adaptive processing at in-transit nodes reduced %idle time per data item from 40% to 2%

• Quality of data received at sink improved with adaptation at in-transit nodes

• End to End cooperative management – Results in better QoS than standalone mechanisms at the two

levels which operate without cooperation• Less data (25%) was written at the data generation end in face of

congestions at in-transit nodes



Healthcare Scenario – Patient-driven intervention

• Support for “continuous” monitoring of an individual “Individualised Healthcare”– Post-operative care (short term)– Long-term home care

• Medical sensors of different types– Reaction to intake of particular medication– Manage progression of disease– Assess post-operative care

Key Issues

• Power management– Limited power usage (periodic transmissions)– Power through body movements– Mechanisms for energy storage and use of multiple power

sources

• Data Quality– Reduce total data volume transmitted

• Development of implantable “biocompatible” sensors– Use of specialist “blood analyte”– “BioSensing” – signals generated from biological

phenomenon (glucose and lactate enzymes)

• Fusing data from multiple sensors



0.0

2.0

4.0

6.0

8.0

10.0

00:0

0

00:4

0

01:2

0

02:0

0

02:4

0

03:2

0

04:0

0

04:4

0

05:2

0

06:0

0

06:4

0

07:2

0

08:0

0

08:4

0

09:2

0

10:0

0

10:4

0

11:2

0

12:0

0

12:4

0

13:2

0

14:0

0

14:4

0

15:2

0

16:0

0

16:4

0

17:2

0

18:0

0

18:4

0

19:2

0

20:0

0

20:4

0

21:2

0

22:0

0

22:4

0

23:2

0

Time

Glu

cose

(mm

ol/L

)

Devising ‘Trends’ Algorithms Based on Single Data Points(reality check – single points are gappy and error-prone)

Compare continuous data - 576 blood glucose concentration measurements(every 5 min over 24 hours then the same after 6 months of a drug intervention in same individual)

Pre-intervention

Post-intervention

Healthcare@Home – Integrated Inputs

©2005 Cardiff University / WeSC

BLUETOOTH

FromEd Conley

Data may be recorded and maintained atvarious places

Need for analysis at a central site



Logical Architecture

End to End PatientView

Overall System/ServerArchitecture View

Intended Users:

•Clinicians (Doctors, Nurses, etc)•Patients •Researchers•Policy Makers

@ Home – sending medical reading to server

(1) Patient uses glucose meter

(2) Device sends reading via Bluetooth

GPRS

(3) Hub sends reading via GPRS toH@H server



On receiving medical data from sensors

GPRS

Validatedata DCCR

Diabetes ContinuingCare Reference

save tonotification

Autonomic (Physics/Model/System Driven) Runtime Management ation Runtime Management in V-Grid

Grid Resource Hierarchy

Application Domain Hierarchy

Virtual Grid Resource Autonomic Runtime Manager (ARM)

Loop for each level of Grid/Application hierarchy

V-Grid Monitoring( Self-observation, Context-awareness )

System states (CPU, Memory, Bandwidth, Availability etc.)Application states (Computation/Communication Ratio, Nature of Applications, Application Dynamics)

V-Grid Deduction( Self-adaptation, Self-optimization, Self-healing)

Identify and characterize natural regionsDefine objective functions and management strategyDefine VCUs

V-Grid ExecutionPartition, Map and Tune

NR1 NR2 NR5

VCU10

VCU11 VCU2

1VCUn

1

VCU12 VCU2

2 VCUi2 VCUj

2

Self-

lear

ning

NR: Application Natural RegionsVCU: Virtual Computational Unit

V-Grid

VirtualResource

Unit

SP2 ClusterBeowulf

Linux

VirtualResource

Unit

E10K

Workstation

Beowulf

Workstation

VirtualResource

Unit

IBM SP2

t

NCM

t

NCM

t

NCM

t

NCM

t

NCM

t

NCM

t

NCM

IBM SP2

WAN

Institutional

Divisional/Departmental

ComputingNode

Cap

abili

ty



Autonomic Runtime Management

Self-Optimization& Execution

Self-Observation

& Analysis

AutonomicPartitioning

Partition/ComposeRepartition/Recompose

VCUVCUVirtual Computation

Unit

VCUVirtualResource

Unit

Dynamic Driver Application

Monitoring &Context-Aware

Services

Application

Monitoring

Service

Resource

Monitoring

Service

Heterogeneous, Dynamic Computational Environment

Natural Region Characterization

PerformancePrediction

Module

CPUSystem

CapabilityModule

MemoryBandwidthAvailability

Access Policy

Resource History Module

System StateSynthesizer

Application State Characterization

Nature of Adaptation

ApplicationDynamics

Computation/Communication

ObjectiveFunction

Synthesizer

Prescriptions

MappingDistribution

Redistribution

Execution

NRM NWM

Normalized Work Metric

NormalizedResource Metric

Autonomic Scheduling

VGTS: Virtual Grid Time SchedulingVGSS: Virtual Grid Space Scheduling

Global GridScheduling

Local GridScheduling

VGTS VGSS VGTS VGSS

Virtual GridAutonomic

RuntimeManager

Current System State

Current Application State

DeductionEngine

DeductionEngine

DeductionEngine

Experimental Evaluation – Autonomic Runtime Management

Experiment Setup:IBM SP4 cluster (DataStar at San Diego Supercomputer Center,

total 1632 processors)SP4 (p655) node: 8 processors(1.5 GHz), memory 16 GB, 6.0 GFlops

RM3D: refinement levels = 3refinement factor = 2

Performance gain

AHMP 30% - 42%LPA 20% - 29%

GPA: GreedyLPA: Level-based

Execution Time for RM3D Application (1000 time steps, size=256x64x64)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

64 128 256 512 1024 1280

Number of Processors

Exe

cutio

n Ti

me

(sec

)

GPALPASBC+AHMP

Overall Performance



ARMaDA: Application-sensitive Adaptations

• PAC tuple, 5-component metric• Octant approach: app. runtime state• GrACE (ISP), Vampire (pBD-ISP,

GMISP+SP) partitioners• ARMaDA framework

– Computation/communication– Application dynamics– Nature of adaptation

• RM3D, 64 procs on “Blue Horizon”– 100 steps, base grid 128*32*32– 3 levels, RF = 2, regrid 4 steps

ARMaDA: System-sensitive Adaptations

• System characteristics using NWS• RM3D compressible turbulence

application– 128x64x64 base (coarse) grid– 3 levels, factor 2 refinement

• System/Environment– University of Texas at Austin (32

nodes), Rutgers (16 nodes)

Capacity Calculator

System Sensitive Partitioner

CPU

Memory

Bandwidth

Capacity

Applications

Weights

Partitions

ResourceMonitoringTool

kbkmkpk BwMwPwC ++=

0100200300400500600700800900

Execution time (secs)

4 8 16 32

Number of processors

Non System-SensitiveSystem-Sensitive

430225842427264502924

805.5423.72

Static Sensing (s)

Dynamic Sensing (s)Procs



Handle Different Resource Situations

Efficiency

Performance

Survivability

less space

more time

applicaton-level out-of-core

(ALOC)

less time

more space

applicaton-level pipelining

(ALP)

Space Time

(a)

(b)

(c)

When resourcesare under-utilized

When resourcesare scarce

ALP: Trade in space (resource) for time (performance)ALOC: Trade in time (performance) for space (resource)

Experimental Results - ALP

Experiment Setup:IBM SP4 cluster (DataStar at San Diego Supercomputing Center,

total 1632 processors)SP4 (p655) node: 8 processors(1.5 GHz), memory 16 GB, 6.0 GFlops

Performance gainup to 40% on 512 processors

Execution Time for RM3D Application (1000 time steps, size=128x32x32)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16 32 48 64 96 128 256 384 512

Number of Processors

Exe

cutio

n Ti

me

(sec

)

AHMP without ALPAHMP with ALP



Effects of Finite Memory - ALOC

Intel Pentium 4 CPU 1.70GHz, Linux 2.4 kernelCache size: 256 KB, Physical memory: 512 M, Swap space: 1 G.

Processing Time versus Allocated Memory

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 51 101 151 201 251 301 351 401 451 501 551 601

Memory Allocation (MB)

Pro

cess

ing

Tim

e (m

s/op

erat

ion)

Processing Time versus Allocated Memory

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

1 51 101 151 201 251 301 351 401

Memory Allocation (MB)

Pro

cess

ing

Tim

e (m

s/op

erat

ion)

Experimental Results - ALOC

Number of Page Faults for RM3D Application (100 time steps, size=128x32x32, 4 refinement levels)

0

1000

2000

3000

4000

5000

6000

1 11 21 31 41 51 61 71 81 91

Regridding Steps

Num

ber

of P

age

Faul

ts

NonALOCALOC

crash point

Execution Time for RM3D Application(100 time steps, size=128x32x32, 4 refinement levels)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 11 21 31 41 51 61 71 81 91Regridding Steps

Exec

utio

n Ti

me

NonALOCALOC

crash point

Boewulf Cluster (Frea at Rutgers, 64 processors)Intel Pentium 4 CPU 1.70GHz, Linux 2.4 kernelCache size: 256 KB, Physical memory: 512 M, Swap space: 1 G.



Conclusion

• Pervasive Computational Ecosystems and eScience– Unprecedented opportunity

• can enable a new generation of knowledge-based, data and information driven, context-aware, computationally intensive, pervasive applications

– Unprecedented research challenges• scale, complexity, heterogeneity, dynamism, reliability, uncertainty, …• applications, algorithms, measurements, data/information, software

• Autonomic Grid Computing– Addressing the complexity of pervasive Grid environments

• Project AutoMate: Autonomics for eScience on the Grid– Semantic + Autonomics – Accord, Rudder/Comet, Meteor, Squid, Topos, …

• More Information, publications, software– www.caip.rutgers.edu/~parashar/– [email protected]

Thank You!

Autonomic Grid Computing Autonomics for eScience · 2017. 10. 30. · Autonomic Grid Computing...

Documents

Transcript of Autonomic Grid Computing Autonomics for eScience · 2017. 10. 30. · Autonomic Grid Computing...