Hytönen, Roni; Tshala, Alison; Schreier, Jan; Holopainen ...
Resiliency and self-healing Visa Holopainen, [email protected].
-
Upload
scot-george -
Category
Documents
-
view
215 -
download
0
Transcript of Resiliency and self-healing Visa Holopainen, [email protected].
![Page 2: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/2.jpg)
Reinforcement Learning for Autonomic Network Repair, M. Littman, N. Ravi, E. Fenson, R. Howard, 2004
Reinforcement learning– Used to solve Markov decision problems (MDPs)
States, actions, rewards, transitions, transition probabilities
– Agent explores an environment in which it perceives its current state and takes actions to reach new states
– A reward is assosiated to every state– Reinforcement learning tries to find a policy for maximizing
cumulative reward for a task
![Page 3: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/3.jpg)
(Simplified) Reinforcement Learning example
Which direction should the agent move?
Goal
State
Agent
Goal
State
![Page 4: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/4.jpg)
Reinforcement Learning example (cont)
Agent makes random moves until a Goal state is reached
Goal
State
+ Agent
|
V
|
V
Goal
State
![Page 5: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/5.jpg)
Reinforcement Learning example (cont)
Now a policy is associated with the state from which the goal state was reached
Goal
State
Goal
State
![Page 6: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/6.jpg)
Reinforcement Learning example (cont)
Now if at some point state S (that has policy associated to it) is reached from state S’, a policy is assigned to S’ also
Goal
State
SS’
Goal
State
![Page 7: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/7.jpg)
Reinforcement Learning example (cont)
After some amount of iterations the optimal policies have been formed
Goal
State
Goal
State
![Page 8: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/8.jpg)
Reinforcement Learning example (cont)
The corresponding state rewards
Goal
State-1-2-3
-1-2-3-2
-2-3-2-1
-3-2-1
Goal
State
![Page 9: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/9.jpg)
Implemented concept
Reinforcement learning is used to restore network connectivity after a failure
Starting state: no connectivity, Goal state: connectivity
Actions: PingGateway, PingIP, DNSLookup, UseCachedIP, FixIP, RenewLease, UseCachedIP
Learned policy in the picture Prototype implemented Nice concept but not very useful…
![Page 10: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/10.jpg)
Approaches to Building Self HealingSystems using Dependency Analysis, J. Gao, G. Kar, P. Kermani, 2004
Problems– Is there a way to automatically determine the root cause(s)
of a downgraded performance of i.e. an Internet shopping site
– Provided that the root cause(s) can be determined, are there some ways to automatically fix this problem
![Page 11: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/11.jpg)
Architecture
Distributed System– A typical multi-tier e-Business
system (web access, database) The Monitoring System
– Includes monitoring agents that monitor 1) the response time of the system from user’s perspective and 2) the application components (servlets, EJBs,…)
The Dependency Matrix– Which transactions depend on
which system components Self-healing Engine
– Launched when a performance problem is noticed by monitoring system
![Page 12: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/12.jpg)
Problem description
Based on previous work a dependency matrix can be formed
The matrix informs which customer transactions depend on which system resources
Using this matrix the system resource that causes a preformance problem can be tracked
The initial goal was to minimize the needed transactions to find the root cause of a problem
This problem is found to be NP-hard -> a heuristic solution is presented
![Page 13: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/13.jpg)
Solution
No solution can be guaranteed to be found if two or more matrix columns are similar
Assume that 1) all matrix colums are different and 2) there is only one broken system component
– Now the solution can be found by the following algorithmThe set of all resources is denoted S. The set of all transactions is
denoted T
1) Run all transactions one by one
2) If a trasaction succeeds then remove all resources that this trasaction depends on from S.
3) Finally only one resource is left in S. This is the broken resource.
![Page 14: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/14.jpg)
Solution (cont)
If the fixed set of customer transactions cannot locate the root cause of performance problem, synthetic transactions need to be created and executed
Many practical difficuties exists in doing so No testing
![Page 15: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/15.jpg)
Ensembles of Models for Automated Diagnosis of System Performance Problems, S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, A. Fox, 2005
Ensemble = collection SLA contains Service Level Objectives (SLO)
– SLO example: “Server downtime < X sec in a day” Problem: Which system metrics correlate with SLO
violations?– Example system metrics: CPU metrics, Memory, I/O,
Network activity coming in and out of servers, Swapspace usage, Paging, etc…
Tree Augmented Naïve Bayes (TAN) models– Determine which low-level metrics most likely contributed to
an SLO violation– A mapping function is learned by the algorithm
![Page 16: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/16.jpg)
TAN model example
”Given SLO state (SLO violation) S, what is the most predictive set of system-level metrics for S”
Combinations of metrics more predictive of SLO violations than individual metrics
Small numbers of metrics (3-8) usually sufficient to predict SLO violation
![Page 17: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/17.jpg)
Multiple TAN models
TAN models that are built using data collected under some conditions don't work well on data collected under different conditions -> need to maintain multiple TAN models
The model that best suits the current conditions is chosen by using Brier score
– Brier score is similar to Mean Squared Error (MSE) and offers a fine grained evaluation of a model
![Page 18: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/18.jpg)
Results
Ensembles of models outperform single model
Also do slightly better than workload specific approach
– Indicates that some workload conditions too complex for single model
BA = Balanced AccuracyFA = False AlertsDet = Detections
![Page 19: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/19.jpg)
TAN summary
Ensemble of models perform better than single model
The approach allows for rapid adaptation to changing conditions
No domain specific knowledge is required Different workloads seem to be characterized
by different metric-attribution “signatures” (future work)
![Page 20: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/20.jpg)
Towards Autonomic Web Services: Achieving Self-Healing Using Web Services, S. Gurguis, A. Zeid, 2005
CBE-log is a representation format into which log files of all different applications can be converted
Diagnosis Engine selects a set of repair actions
The Symptoms Database is an XML-file containing symptoms and recovery actions
Rule Engine decides which repair actions should be taken based on the Policy Database
No prototype implemented
![Page 21: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/21.jpg)
A typical record in the Symptom Database presented in the picture
Possible application: legacy systems
![Page 22: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/22.jpg)
Reflection, Self-Awareness and Self-Healing in OpenORB, G. Blair, G. Coulson, et al. 2002
OMG (Object Management Group)– An open membership, not-for-profit consortium that produces and
maintains computer industry specifications for interoperable enterprise applications
OMG CORBA (Common Object Request Broker Architecture)– Open, vendor-independent architecture and infrastructure that computer
applications use to work together over networks– Supports communication between different types of operating systems,
programming languages and networks– Interfaces defined in OMG IDL (Interface Definition Language)– Mappings exists between IDL and C, C++, Java, COBOL, Smalltalk, Ada,
Lisp, Python, and IDLscript OpenORB
– Provides a Java implementation of the OMG CORBA 2.4.2 specification
![Page 23: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/23.jpg)
Example, OMG IDL <-> C mappings
![Page 24: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/24.jpg)
OpenORB self-healing
Meta-interface supports access to the underlying platform Open ORB supports the ability to discover meta-information
about the current system, both in terms of its structure and ongoing behaviour
System properties can also be adapted by using the appropriate meta-interfaces
Management component can be introduced (dynamically) into the various meta-space models
??
![Page 25: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/25.jpg)
Measuring the Effectiveness of Self-Healing Autonomic Systems, A. Brown, C. Redlin, 2005
SPEC (Standard Performance Evaluation Group)– Non-profit corporation that maintains a standardized set of relevant
benchmarks applicable to the newest generation of high-performance computers
SPEC jAppServer2004– Benchmark for measuring the performance of J2EE application
servers– An end-to-end application which exercises all major J2EE technologies
Based on jAppServer2004 a benchmarking system was created that is capable of quantifying the autonomic self-healing capability of a large-scale J2EE software solution
The system is used in various production environments
![Page 26: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/26.jpg)
The Architecture
30 different types of disturbances representing common failure modes can be injected into the SUT– Component shutdowns, data loss,
resource exhaustion, load surges, operator errors, ...
Two metrics are used to evaluate SUT’s self-healing capacity1) How effectively the SUT heals itself
Basically measured by counting how many requests the jAppServer2004 gets right in case of disturbance while compared to normal working conditions
2) How autonomic the healing response is
A 90-question survey is used
![Page 27: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/27.jpg)
The Survey
The 90-question survey assigns points to the SUT based on the level of automation present in its response to each disturbance (based on IBMs autonomic computing maturity model)
– 0 points for a basic manual response, 1 point for a managed response, 2 for predictive, 4 for adaptive, and 8 for autonomic
“...Our baseline run on SUT #1 resulted in an average healing effectiveness score of 0.79 and an autonomic maturity score of 0.15 (both out of 1.0), indicating a relatively low level of autonomic self-healing capability. In comparison, SUT #2 attained an effectiveness score of 0.83 and a maturity score of 0.22. Comparing the two results indicates that SUT #2’s system management technology provided a small—but measurable—improvement in autonomic capability...”
![Page 28: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/28.jpg)
Personal Autonomic Computing Self-Healing Tool, R. Sterritt, S. Chung, 2004
A self-healing tool consisting of pulse monitor and a health monitor
Used in PC-environment Pulse Monitoring application (PBM) is an UDP-based peer-to-
peer application which1) Checks whether hosts are providing a ‘heartbeat’ or not and 2) Indicates the health level of the system (state of processes)3) Reboots a neighbor if no heartbeat is heard from it (security?)
Health Monitoring runs on a host and restarts a process on the same host if it’s not responding
Combines three old concepts: watchdog processes, hello-mechanism, and remote control
![Page 29: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/29.jpg)
The Architecture
Pulse Monitor (Java) communicates with platform-specific Health Monitor (C) through JNI
Main monitor monitors Pulse monitor and Health monitor
![Page 30: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/30.jpg)
Testing
A proof-of-concept prototype system was built on MS. Windows platform
Future topics: more autonomic functionality & supported platforms
Maybe useful when human administration not possible (sensor networks?)
![Page 31: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/31.jpg)
Conclusions
1) Reinforcement Learning for Autonomic Network Repair– Learn autonomically the best sequence of actions to repair a network
outage– Prototype implemented and tested (useful?)
2) Approaches to Building Self Healing Systems using Dependency Analysis
– Determine the root-cause of downgraded performance and try to fix it– No testing, use 3. instead?
3) Ensembles of Models for Automated Diagnosis of System Performance Problems
– Suitable (tested) system for (Hewlett Packard) server systems– Pinpoints causes of SLO violation
4) Towards Autonomic Web Services: Achieving Self-Healing Using Web Services
– Autonomic web server healing system– No testing
![Page 32: Resiliency and self-healing Visa Holopainen, visa@netlab.tkk.fi.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649ebe5503460f94bc8de6/html5/thumbnails/32.jpg)
Conclusions
1) Reflection, Self-Awareness and Self-Healing in OpenORB– ?
2) Measuring the Effectiveness of Self-Healing Autonomic Systems
– Suitable system for J2EE server systems– Provides users with a quantitative way to measure the self-
healing capability of their IT systems– Implemented and in use
3) Personal Autonomic Computing Self-Healing Tool– Enables a group of PCs to monitor the health of each other– Applications?– Prototype implemented
Overall much discussion about server self-healing