Transparent Cross-Border Migration of Parallel Multi Node Applications
description
Transcript of Transparent Cross-Border Migration of Parallel Multi Node Applications
Transparent Cross-Border Migration of Parallel Multi Node Applications
Dominic Battré, Matthias Hovestadt, Odej Kao, Axel Keller, Kerstin Voss
Cracow Grid Workshop 2007
Axel Keller 2
Outline
MotivationThe Software StackCross-Border MigrationSummary
Highly Predictable Clusters for Internet-Grids
EC funded project in FP6
Advanced Risk Assessment & Management for Trustable Grids
EC funded project in FP6
Axel Keller 3
The Gap between Grid and RMS
SLA
RMS RMS RMS
M1 M2 M3
grid middleware
user request
Reliability? Quality of Service?
Best Effort!
User asks for SLA Grid Middleware realizes
job by means of local RMS
BUT: RMS offer Best Effort Need: SLA-aware RMSGuaranteed!
Axel Keller 4
HPC4U: Highly Predictable Clusters for Internet-Grids
Objective Software-only solution for an SLA-aware, fault
tolerant infrastructure, offering reliability and QoS, and acting as active Grid component
Key Features System level checkpointing Job migration Job types: sequential and MPI-parallel Planning based scheduling
Axel Keller 5
HPC4U: Planning Based Scheduling
queues
new
jobs
new
jobs
time
queuing systems
planning systems
planned time frame
present present and future
new job requests
insert in queues re-planning
assignment of planned start time
no all requests
runtime estimation
not necessary mandatory
backfilling optional yes, implicit
advance reservations
not possible yes, trivial
Machine
Axel Keller 6
HPC4U: Software Stack
Process Network Storage
RMS
Negotiation
User- / Broker- Interface
Cluster
Scheduler
CLI
SSC
Axel Keller 7
HPC4U: Checkpointing Cycle
RMS
Network StorageProcess
1. CP job+halt
2. In-TransitPackets
4. Snap-shot !
5. Link to Snapshot
6. Resume job
7. Job runningagain
3. Return: “Checkpoint
completed!”
Axel Keller 8
Cross Border Migration: Intra Domain
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Axel Keller 9
Cross Border Migration: Target Retrieval
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Axel Keller 10
Cross Border Migration: Checkpoint Migration
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Axel Keller 11
Cross Border Migration: Remote Execution
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Axel Keller 12
Cross Border Migration: Result Migration
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Process Network Storage
RMS
User- / Broker- Interface
Cluster
CLI CRM
Negotiation
Scheduler
SSC
PP
Axel Keller 13
Cross-Border Migration: Using Globus
Process Network Storage
RMS
User- / Broker- Interface
Cluster
WS-AGCLI CRM
Negotiation
Scheduler
PP
WS-AG implementation based on GT4 Developed in EU project AssessGrid Source specifies SLA / file staging parameters
Subset of JSDL (POSIX Jobs)
Resource determination via broker Source directly contacts destination Destination pulls migration data via Grid-FTP Destination pushes result data back to source Source uses WSRF event notification
SSC
Broker
Axel Keller 14
Ongoing Work: Introducing Risk Management
Process Network Storage
RMS
User- / Broker- Interface
Cluster
Consultant Service
Monitoring
CLI CRM
Negotiation
Scheduler
SSC
RiskAssessor
PP Topic of EU project: AssessGrid Encorporated in SLA Provider
Estimates risk for agreeing an SLA Considers propability of failure in
schedule Assessment based on historical data
BrokerWS-AG
Axel Keller 15
Summary: Best Effort is not Enough
Cross border migration and Risk assessment provide new
means to increase the reliability of Grid Computing.
Axel Keller 16
Read the paper AssessGrid www.assessgrid.eu HPC4U www.hpc4u.eu OpenCCS www.openccs.eu
More information
Thanks for your attention!
Axel Keller 17
Contents
BACKUP
Axel Keller 18
Scheduling Aspects
Execution Time Exact start time Earliest start time, latest finish time
User provides stage-in files by time X Provider keeps stage-out files until time Y Provisional Reservations Job Priorities Job Suspension
Axel Keller 20
HPC4U
Axel Keller 21
Motivation: Fault Tolerance
Commercial Grid users need SLAs Providers cautious on adoption
Reason: Business case risk
Missed deadlines due to system failures
Penalties to be paid Solution: Prevention with Fault Tolerance
Fault tolerance mechanisms available, but1. Application modification mandatory
2. Overall solution (System software, process, storage, file system, network) required
3. Combination with Grid migration missing
Axel Keller 22
HPC4U Objective
Software-only solution for a SLA-aware, fault tolerant infrastructure, offering reliability and QoS, acting as active Grid component
Key features Definition and implementation of SLAs Resource reservation for guaranteed QoS Application-transparent fault tolerance
Axel Keller 23
HPC4U: Concept
1. SLA negotiation as an explicit statement of expectations and obligations in a business relationship between provider and customer
2. Reservation of CPU, storage and network for desired time interval
3. Job start in checkpointing environment
4. In case of system failure
Job migration / restart with respect to SLA
Axel Keller 25
Phases of Operation
StageIn
Compu-tation
StageOut
timePre-Runtime
Runtime Post-Runtime
Negotiation
Lifetime of SLA
Allocationof systemresources
Acceptance(or rejection)
of SLA
Negotiation of SLA Pre-Runtime: Configuration of Resources
e.g. network, storage, compute nodes Runtime: Stage-In, Computation, Stage-Out Post-Runtime: Re-configuration
Axel Keller 26
Phase:Pre-Runtime
• Task of Pre-Runtime Phase Configuration of all allocated resources Goal: Fulfill requirements of SLA
• Reconfiguration affects all HPC4U elements Resource Management System
– e.g. configuration of assigned compute nodes Storage Subsystem
– e.g. initialization of a new data partition Network Subsystem
– e.g. configuration of network infrastructure
Axel Keller 27
Phase: Runtime
Runtime Phase = lifetime of job in system adherence with SLA has to be assured FT mechanisms have to be utilized
Phase consists of three distinct steps Stage-In
transmission of required input data from Grid customer to compute resource
Computation execution of application
Stage-Out transmission of generated output data from
compute resource back to Grid customer
Axel Keller 28
Phase: Post-Runtime
Task of Post-Runtime Phase: Re-Configuration of all resources
e.g. re-configuration of network e.g. deletion of checkpoint datasets e.g. deletion of temporary data
Counterpart to Pre-Runtime Phase
Allocation of resources ends Update of schedules in RMS and storage Resources are available for new jobs
Axel Keller 30
PROCESS
Axel Keller 31
Subsystems
Process Subsystem checkpointing of network cooperative checkpointing protocol (CCP)
Network Subsystem checkpoint network state
Storage Subsystem provision of storage provision of snapshot
Axel Keller 33
STORAGE
Axel Keller 34
Storage Resource 1Storage Resource 1 Storage Resource 2Storage Resource 2
Storage subsystem
Functionalities Negotiates the storage part of the SLA Provides storage capacity at a given QoS level Provides FT mechanisms
Requirement: manage multiple jobs running on the same SR
Computing
Storage
Physical space
Virtual Storage ManagerVirtual Storage Manager
Interface VSM - SR
Logical space(data layout strategies)
Axel Keller 35
Data Container concept
Idea: create storage environment for applications at a desired
QoS level with abstraction of physical devices
Components:
LogicalVolume
File System
Job
File I/O (read, write, open,…)
Block I/O (read, write, ioctl)
Storage ResourceStorage Resource
DataContainer
Physical devices
BlockAddressMapping
data layout policies(e.g., simple striping)
Block I/O
Logicalspace
Axel Keller 36
Data container properties
Storage part of the SLA
Data container section Size File system type Number of nodes that need to access the data container
(private/shared)
Performance section Application I/O profile Benchmark Bandwidth (in MB/s or IO/s) Or Default configuration
Dependability section Data redundancy type (within a cluster) Snapshot needed or not Data replication or not (between clusters)
Job specific section Job’s time to schedule and time to finish
Axel Keller 37
Fault Tolerance Mechanisms
RAID Tolerate the failure of one or more disks
RAIN Tolerate the failure of one or more nodes
Implementation Hardware Software
Storage FT mechanisms rely on special data layouts
Software
Storage Snapshot
Axel Keller 38
Data container snapshot
Provide instantaneous copy of data containers
Technique used: Copy-On-Write (COW) create multiple copies of data without duplicating
all the data blocks
With checkpoint, it allows application restart from a previous running stage
Impact on SR performance Taken into account at negotiation time
Axel Keller 39
Redundant data layout
Job
Storage ResourceStorage Resource
Snapshot single node job restart after node failure
Characteristics:
• The job is running on a single node
• The data container is private to that node
• Data container snapshot resides on the same storage resource
Restore job’s state from previous checkpoint
4
Node failure1
1
Restore job’s data from previous snapshot
2
2
3 Start data container
3
Job
4
Job restart5
5
Axel Keller 40
Interfaces with other components
RMSRMSRMSRMS
VSMVSMVSMVSM
datacontainer
datacontainer
datacontainer
Storage Resource (SR)Storage Resource (SR)Storage Resource (SR)Storage Resource (SR)
Interface VSM - RMS
Interface VSM – SR
Storage Storage SubsystemSubsystemStorage Storage SubsystemSubsystem
Network (socket , RDMA, …)Network (socket , RDMA, …)Network (socket , RDMA, …)Network (socket , RDMA, …)
VSMVSMVSMVSM
ExanodesExanodesExanodesExanodes
wrapper
ClassicalClassicalStorage ArrayStorage Array
wrapper
ClassicalClassicalStorage ArrayStorage Array
ClassicalClassicalStorage ArrayStorage Array
SR_type1 SR_type2
Open Open SourceSourceOpen Open SourceSource
ProprietaryProprietaryProprietaryProprietary
• client-server• callbacks
Axel Keller 41
ASSESSGRID
Axel Keller 43
Grid Fabric Layer with Risk Assessor
• NegotiationManager
- Agr./Agr.Fact. WS
- checks whether offer complies to template
- initiation of file transfers
• Scheduler
- creates tentative schedules for offers
• Risk Assessor
• Consultant Service
- records data
• Monitoring
- runtime behavior
Axel Keller 46
Precautionary Fault-Tolerance
How many
spare
resources are
available at
execution time?
•Use of planning based scheduler
Axel Keller 47
Estimating Risk for a Job Execution
Use of planning based scheduler How much slack time is available for
fault tolerance? How much effort do I undertake for
fault tolerance?
What is the considered risk of resource failure?
Earliest Start Time Latest Finish Time
Execution Time Slack Time
Axel Keller 48
Risk Assessment
low riskmiddle riskhigh risk
Estimate risk for agreeing an SLA consider risk of resource failure estimate risk for a job execution initiate precautionary FT mechanisms
Axel Keller 49
Risk Management at Job Execution
Risk Management
Decisions
Actions
Risk Assessment
Business Model (price, penalty)
Weekend/Holiday/Workday
Schedule (SLAs, best effort)
Redundancy Measures
Events
Axel Keller 50
Detection of Bottlenecks
Consultant Service Analysis of SLA violation
Estimated risk for the job Planned FT mechanisms
Monitoring Information Job Resources
Data Mining Find connections between SLA violations Detect weak points in the provider’s infrastructure
Axel Keller 51
WS-AG
Axel Keller 52
Components
Axel Keller 53
Implementation with Globus Toolkit 4
Why Globus? Utility: Authentication, Authorization, Delegation, RFT, MDS, WS-
Notification Impact
Problem 1: GRAM (Grid Resource Allocation and Management) State machine, incl. File-Staging, Delegation of Credentials, RSL Cannot use it: written for batch schedulers, nor for planning
schedulers Problem 2: Deviations from WS-AG spec.
Different Namespaces WS-A, WS-RF
Axel Keller 54
Implementation with Globus Toolkit 4
Technical Challenges xs:anyType
Wrote custom serializers/deserializers Subtitution groups
Used in ItemConstraint (Creation Constraints) Cannot be mapped to Java by Axis Replaced by xs:anyType – use as DOM tree
CreationConstraints Namespace prefixes in XPaths meaningless Need for WSDL and interpretation for xs:all, xs:choice, and friends
Axel Keller 55
Context
<wsag:Context> … <wsag:AgreementInitiator> <AG:DistinguishedName> /C=DE/O=… </AG:DistinguishedName> </wsag:AgreementInitiator> <wsag:AgreementResponder>EPR</…> <AG:ServiceUsers> <AG:ServiceUser>DN</…> </AG:ServiceUsers> …</wsag:Context>
Context
Terms
Creation Constraints
Axel Keller 56
Terms, SDTs
Conjunction of terms Common structure of templates WS-AG too powerful/difficult to fully support
Service Description Term (one) assessgrid:ServiceDescription (extension of abstract
ServiceTermType) jsdl:POSIXExecutable (executable, arguments,
environment) jsdl:Application (mis-)used for libraries jsdl:Resources jsdl:DataStaging * assessgrid:PoF (upper bound)
Context
Terms
Creation Constraints
Axel Keller 57
Terms, GuaranteeTerms
No hierarchy but two meta guarantees ProviderFulfillsAllObligations
e.g. Reward: 1000 EUR, Penalty 1000 EUR
ConsumerFulfillsAllObligations e.g. Reward: 0 EUR, Penalty 1000 EUR
First violation is responsible for failure No hardware problem, then User fault Other Guarantees
Execution Time Any start time (best effort) Exact start time Earliest start time, latest finish time
User provides StageIn files by time X Provider keeps StageOut files until time Y
Context
Terms
Creation Constraints
No timely execution
No stage-out
Axel Keller 58
Terms
• SLA does not contain requirements of fault tolerance mechanisms• Covered by asserted PoF, penalty and loss of reputation
• Compulsory Assessment Intervals not really useful for us• How often do you assess that job was allocated for asserted
time?• Preferences too complicated
Context
Terms
Creation Constraints
Axel Keller 59
CreationConstraints
• Difficult to support Namespaces:• //wsag:…/assessgrid:… - prefixes are just strings
• Very difficult to support structural information• xs:group, xs:all, xs:choice, xs:sequence
• Possible but difficult to support xs:restriction xs:simple
Check for enumeration (xs:restriction of xs:string) Check for valid dates (xs:restriction of xs:date) Everything else close to impossible
{min,max}{In,Ex}clusive totalDigits, fractionDigits, length,
… probably useless
Context
Terms
Creation Constraints