Page 1 Hans Peter Schwefel PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Page 1 Hans Peter Schwefel PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04...
Page 1Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
PhD Course: Performance & Reliability Analysis of IP-based Communication NetworksHenrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen
• Day 1 Basics & Simple Queueing Models(HPS)
• Day 2 Traffic Measurements and Traffic Models (MC)
• Day 3 Advanced Queueing Models & Stochastic Control (HPS)
• Day 4 Network Models and Application (HS)
• Day 5 Simulation Techniques (HPS, SA)
• Day 6 Reliability Aspects (HPS)
Organized by HP Schwefel & H Schiøler
Page 2Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models, Life-time models, Repair-Models• Performability
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
Page 3Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Motivation: Failure Types in Communication NWs
WANs PSTNs
• IP networks designed to be ‘robust’ but not highly available• End-to-end availability
– Public Internet: below 99% (see e.g. http://www.netmon.com/IRI.asp).– PSTN: 99.94%
• Operation, Administration, and Maintenance (OAM) one major source of outage times
Page 4Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Overview: Role of OAM
System High AvailabilitySystem High Availability
Network Element HANetwork Element HA
keep the system stablekeep the system stableovercome outagesovercome outages
resource supervision health/lifetime supervision overload protection
Supervision & AuditingSupervision & AuditingContinuous ExecutionContinuous Execution redundancy and diversity state replication message distribution
Error HandlingError Handling detection and recovery tracing and logging escalation and alarming operational modes
NE-InterfaceNE-Interface multi-homing stacks and protocols IP fail-over
network design network redundancy
/fail-over
Network HA backup and restore regional redundancy rolling data upgrade
Data HA
repair and replacement Fault Management HA control & verification system diagnosis
OAM
Rolling UpgradeRolling Upgrade rolling upgrade patch procedure migration
Startup & ShutdownStartup & Shutdown system startup graceful shutdown
Adapted from S. Uhle, Siemens ICM N
Page 5Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
OAM Concepts I: Startup/Upgrade/Shutdown• Startup
– Put components into stable operational state– Caution: potential for high-load (synchronisation etc.) in start-up phase– Concept for start-up of larger sub-systems needed (e.g. Mutual dependence)
• Upgrades– Rolling Upgrade
• Allow upgrades of single components while its replicates are in operation• Consistency/data compatibility problems
– Patch concept (SW)• Instead of full SW re-installation, incremental changes• Goal: zero or minimal outage of components
• Graceful shutdown– Take components safely out of operation– Possible steps:
• Stop accepting/redirect new tasks• Finish existing tasks• Synchronize data and isolate component
• HW: Hot plug/swap ability– E.g. Interface cards
• Life Testing
Page 6Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
OAM Concepts II:Monitoring/Supervision
• Resource Supervision and Overload Protection– E.g. CPU load, queue-lengths, traffic volumes– Alarming operator intervention, e.g. upgrades– Signalling to reduce overload at source– Graceful degradation
• Logging & Tracing– For off-line analysis of incidents– Correlation of traces for system analysis– Problem: adequate granularity of logging data
• Service concepts/contracts– Reaction to alarms– System recovery modes– Spare-parts handling– Qualified technicians, availability (24/7 ?)
• High-availability of OAM system– Redundancy (frequently separate
OAM network(s))– Prioritisation of OAM traffic– Handling/Storing of OAM/logging data
Page 7Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models, Life-time models, Repair-Models• Performability
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
Page 8Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
General Approaches: Fault-tolerance• Basic requirements for fault-tolerance
– Number of fault-types and number of faults is bounded– Existence of redundancy (structural: equal/diversity; functional; information; time-redundancy/retries)
• Functional Parts of fault-tolerant systems– Fault detection (& diagnosis)
• Replication and Comparison (if identical realisations not suitable for design errors!)• Inversion (e.g. Mathematical functions)• Acceptance (necessary conditions on result, e.g. ranges)• Timing behavior (time-outs)
– Fault isolation: prevent spreading• Isolation of functional components, e.g. Atomic actions, layering model
– Fault recovery • Backward: Retry, Journalling (Restart), Checkpointing & Rollback (non-trivial in distributed
cooperating systems!)• Forward: move to consistent, acceptable, safe new state; but loss of result• Compensation, e.g. TMR, FEC
Page 9Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Software fault-tolerance
• Mainly: Design errors & user interaction (as opposed to production errors, wear-out, etc.)
• Observations/Estimates (experience in computing centers, numbers a bit old however)
– 0.25...10 errors per 1000 lines of code– Only about 30% of error reports by users accepted as errors by vendor– Reaction times (updates/patches): weeks to months– Reliability not nearly as improved as hardware errors, various reasons:
• immensely increased software complexity/size• Faster HWmore operations per time-unit executed higher hazard rate
– However, relatively short down-times (as opposed to HW): ca 25 min
• Approaches for fault-tolerance– Software diversity (n-version programming, back-to-back tests of components)
• However, same specification often leads to similar errors• Forced diversity as improvement (?)• Majority votes for complex result types not trivial
Page 10Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models, Life-time models, Repair-Models• Performability
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
Page 11Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Network Connectivity: Basic Resilience
• Loss of network connectivity– Link Failures (cable cut)– Router/component failures along the path
• Basic resilience features on (almost) every protocol layer– L1+L2: FEC, Link-Layer Retransmission,
Resilient Switching (e.g. Resilient Packet Ring)– L3, IP: Dynamic Routing– TCP: Retransmissions– Application: application-level retransmissions,
loss-resilient coding (e.g. VoIP, Video)
• IP Layer Network Resilience: Dynamic Routing, e.g. OSPF– ’Hello’ packets used to determine adjacencies and link-states– Missing hello packets (typically 3) indicate outages of links or routers– Link-states propagated through Link-state advertisements (LSA)– Updated link-state information (adjacencies) lead to modified path selection
Application
TCP/UDP
IP
Link-Layer
L5-7
L4
L3
L2
Page 12Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Dynamic Routing: Improvements• Drawbacks of dynamic routing
– Long Duration until newly converged routing tables (30sec up to several minutes)– Rerouting not possible if first router (gateway) fails
• Improvements– Speed-up: Pre-determined secondary paths (e.g. Via MPLS)
– ’local’ router redundancy: Hot Standby (HSRP, RFC2281) & Virtual Router Redundancy Protocol (VRRP, RFC2338)
• Multiple routers on same LAN• Master performs packet routing• Fail-over by migration of ’virtual’ MAC address
HUB 1
HUB 2
Router 1
Router 2
NE 1
NE 2
HSRPVRRP
Virtual Router
single IP- address of the virtual router in the network for client-Transparency
Page 13Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Streaming Control Transmission Protocol (SCTP)
• Defined in RFC2960 (see also RFC 3257, 3286)
• Purpose initially: Signalling Transport
• Features– Reliable, full-duplex unicast transport (performs retransmissions)– TCP-friendly flow control (+ many other features of TCP)– Multi-streaming, in sequence delivery within streams
Avoid head of line blocking (performance issue)– Multi-homing: hosts with multiple IP addresses, path monitoring (heart-beat mechanism),
transparent failover to secondary paths• Useful for provisioning of network reliability
Host A Host B
IPa1
IPa2 IPb2
IPb1
Separate Networks
SCTP Association
Page 14Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Content 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models, Life-time models, Repair-Models• Performability
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
Page 15Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Resilience for state-less applications
Gateway Node
Service Nodes
1
2
3
4
5
Gateway NodeBackup
Internet
Client Firewall / Switch
• State-less applications– Server response only depends on client
request– Principle in most IETF protocols, e.g. http– If at all, state stored in client, e.g. ’cookies’
• Redundancy in ’server farms’ (same LAN)– Multiple servers– IP address mapped to (replicated) ’load
balancing’ units• HW load balancers• SW load balancers
– Load balancers forward requests to specific Server MAC address according to Server Selection Policy, e.g.
• Random Selection• Round-Robin• Shortest Response-Time First
• Server failure– Load-Balancer does not select failed
server any more
• Examples: Web-services, Firewalls
Page 16Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Cluster Framework• SW Layer across several servers within
same subnetwork• Functionality
– Load Balancer/dispatching functionality– Node Management/
Node Supervision/Node Elimination– Network connection: IP aliasing
(unsolicited/gratuitous ARP reply)
fail-over times few seconds (up to 10 minutes, if router/switch does not support unsolicited ARP!)
• Single Image view (one IP address for cluster)
– For communication partners– For OAM
• Possible Platform: Server Blades
Page 17Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Reliability middleware: Example RTP• Software Layer on top of cluster framework
– All fault-tolerant operations are done within the cluster
– One virtual IP address for the whole cluster (UDP dispatcher redirects incoming messages) “one system image”
– Transparency for the client
• Redundancy at the level of processes:– Each process is replicated among the servers of the cluster– Only a faulty process is failed over
• Notion of node disappears:– Use of logical name (replicated processes can be
in the same node)
• Middleware Functionality (Example: Reliable Telco Platform, FSC)– Context management– Reliable inter-process communication – Load Balancing for Network traffic (UDP/SS7 dispatcher)– Ticket Manager– Events/Alarming– Support of rolling upgrade
–Node Manager•Supervision of processes•Communication of process status to other nodes
–Trace Manager (Debugging)–Administrative Framework (CLI, GUI, SNMP)
Page 18Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Third party applications
common basic functions ...Net
workerSignalware
OPS/RAC
single image view
Component FrameworkComponent Framework
Common Application Framework
OS (e.g. Solaris)HW
basicApps-APIs
Corba
INPay
ment
Parlay/
OSA
Cluster Framework (Primecluster, SUNcluster)
GSM
HLR
Reliant Telco Platform (RTP)Reliant Telco Platform (RTP)
Manage-
ment
Agents
OS (e.g. Solaris)HW
OS (e.g. Solaris)HW
Example Architecture: Cluster middleware
Page 19Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
• Distributed architecture
– Servers need only IP address
– Entities offering same service grouped into pools accessible by pool handle
• Pool User (PU) contact servers (PE) after receiving the response to a name resolution request sent to a name server (NS)
– Name Server monitors PEs
– Messages for dynamic registration and De-registration
– Flat Name Space
• Architecture and pool access protocols (ASAP, ENRP) defined in IETF RSerPool WG
• Failure detection and fail-over performed in ASAP layer of client
Distributed Redundancy: Reliable Server Pooling
NameServer(s)
Server Pool
PE (A) PE (B)
NameServer(s)
(de-) registrationMonitoring
[State-sharing]
Pool User
Name
Fail-over
ASAP=Aggregate Server Access Protocol
ENRP=Endpoint Name Resolution Protocol
Page 20Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Rserpool: more details• RSerPool Scenario
– Each PE contains full implementation of functionality, no distributed sub-processes (different to RTP)
reduced granularity for possible load balancing
More Details:• RSerPool Name Space
– Flat Name Space easier management, performance (no recursive requests)– All name servers in operational scope hold same info (about all pools in operational scope)
• Load Balancing– Load factors sent by PE to Name Server (initiated by PE)– In resolution requests, Name Server communicates load factors to PU– PU can use load factors in selection policies
Page 21Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
RSerPool Protocols: Functionality (Current status in IETF)
• ENRP+ State-Sharing between Name Servers
• ASAP+ (De -)Registration of Pool-elements (PE, Name Server)+ Supervision of Pool-elements by a Name Server + Name Resolution (PU, Name Server)+ PE selection according to policy (& load factors) (PU)+ Failure detection based on transport layer (e.g. SCTP timeout)+ Support of Fail-over to other Pool-element (PU-PE)
+ Business cards (pool-to-pool communication)+ Last will
+ Simple Cookie Mechanism (usable for Pool Element state information) (PU-PE)
+ Under discussion: Application Layer Acknowledgements (PU-PE)
Page 22Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
RserPool and ServerBlades
• Server Blades– Many processor cards in one 19‘‘racks space efficiency
– Integrated Switch (possibly duplicated) for external communication
– Backplane for internal communication (duplicated)
– Management support: automatic configuration & code replication
• Combination RSerPool on Server Blades– No additional load balancing SW necessary (but less granularity)
– Works on any heterogeneous system without additional implementation effort (e.g. Server Blades + Sun Workstations)
– Standardized protocol stacks, no implementation effort in clients (except for use of RSerPool API)
Page 23Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
References/Acknowledgments• M. Bozinovski, T. Renier, K. Larsen: ‘Reliable Call Control’, Project reports and presentations,
CPK/CTIF--Siemens ICM N, 2001-2004.• Siemens ICM N, S. Uhle and colleagues• Fujitsu Siemens Computers (FSC), ‘Reliable Telco Platform’, Slides.• E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German),
TU Munich, 1996.
Page 24Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Summary 1. Basic concepts• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models, Life-time models, Repair-Models• Performability
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
Page 25Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Course Overview & Feedback
Page 26Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
PhD Course: Overview
• Day 1 Basics & Simple Queueing Models(HPS)
• Day 2 Traffic Measurements and Traffic Models (MC)
• Day 3 Advanced Queueing Models & Stochastic Control (HPS)
• Day 4 Network Models and Application (HS)
• Day 5 Simulation Techniques (SA)
• Day 6 Reliability Aspects (HPS,HS)
Page 27Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Day 1: Basics & Simple Queueing Models1. Intro & Review of basic stochastic concepts
• Random Variables, Exp. Distributions, Stochastic Processes, Poisson Process & CLT
• Markov Processes: Discrete Time, Continuous Time, Chapman-Kolmogorov Eqs., steady-state solutions
2. Simple Qeueing Models• Kendall Notation• M/M/1 Queues, Birth-Death processes• Multiple servers, load-dependence, service disciplines• Transient analysis• Priority queues
3. Simple analytic models for bursty traffic• Bulk-arrival queues• Markov modulated Poisson Processes (MMPPs)
4. MMPP/M/1 Queues and Quasi-Birth Death Processes
Page 28Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Day 2: Traffic Measurements and Models(Mark)
• Morning: Performance Evaluation– Marginals, Heavy-Tails
– Autocorrelation
– Superposition of flows, alpha and beta traffic
– Long-Range Dependence
• Afternoon: Network Engineering– Single Link Analysis
• Spectral Analysis, Wavelets, Forecasting, Anomaly Detection
– Multiple Links• Principle Component Analysis
Page 29Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Day 3: Advanced Queueing Models1. Matrix Exponential (ME) Distributions
• Definition & Properties: Moments, distribution function
• Examples• Truncated Power-Tail Distributions
2. Queueing Systems with ME Distributions• M/ME/1//N and M/ME/1 Queue• ME/M/1 Queue
3. Performance Impact of Long-Range-Dependent Traffic • N-Burst model, steady-state performance analysis
4. Transient Analysis• Parameters, First Passage Time Analysis (M/ME/1)
5. Stochastic Control: Markov Decision Processes• Definition• Solution Approaches• Example
Page 30Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Day 4: Network Models (Henrik)
1. Queueing Networks• Jackson Networks
• Flow-balance equations, Product Form solution
2. Deterministic Network Calculus• Arrival Curves
• Service Curves: Concatenation, Prioritization
• Queue-Lengths, Output Flow
• Loops
• Examples: Leaky Buckets,
Page 31Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Day 5: Simulation Techniques (Hans, Soeren)
1. Basic concepts• Principles of discrete event simulation• Simulation tools
2. Random Number Generation
3. Output Analysis• Terminating simulations• Non-terminating/steady state case
4. Rare Event Simulation I: Importance Sampling
5. Rare Event Simulation II: Adaptive Importance Sampling
Page 32Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Day 6: Reliability Aspects1. Basic concepts
• Fault, failure, fault-tolerance, redundancy types• Availability, reliability, dependability• Life-times, hazard rates
2. Mathematical models• Availability models, Life-time models, Repair-Models• Performability Models
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systemsa) SW Fault-Tolerance b) Network Availability
• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems