UBL: unsupervised behavior learning for predicting …menasce/cs788/slides/wicke-d-UBL.pdf · UBL:...
Transcript of UBL: unsupervised behavior learning for predicting …menasce/cs788/slides/wicke-d-UBL.pdf · UBL:...
UBL: unsupervised behavior learning for predicting
performance anomalies in virtualized cloud systems
Dean, Daniel Joseph and Nguyen, Hiep and Gu, Xiaohui, Proceedings of the 9th international conference on Autonomic computing (ICAC'12), San Jose,
California, USA, 2012.
Summarized by: Drew Wicke November 30th 2015
Overview• Introduction
• Self-Organizing Maps (SOMs)
• Experimental Setup
• Experiments and Results
• Conclusion & Critique
Introduction• Problem:
• Anomaly prediction in IaaS (Infrastructure as a service) clouds
• Challenges
• VMs are black boxes to the provider
• Thousands of concurrent jobs
• Impossible to get labeled training data
Unsupervised Learning
• Unlabeled training data = No reward or error signal
• Clustering (K-means, DBSCAN, Birch, etc.)
• Latent Variables (Principle Component Analysis)
• Neural Networks (Self-Organizing Maps, Adaptive Resonance Theory)
Self-Organizing Maps
• No labeled training data
• Maps high dimensional space to low dimensions
• Keeps topological order
• Predict both known and unknown anomalies
SOM Training• Data is normalized to [0-100]
• 32 x 32 lattice network (1024 total neurons)
• Weights are randomly initialized to [0,100]
• K-Fold cross validation for learning phase (K = 3)
• Maps data to neuron using euclidean distance metric
Weight Update• Weight update: W(t+1) =W(t)+N(v, t)L(t)(D(t)−W(t))
• W(t) - weight at time t
• D(t) - data input vector
• N(v, t) - neighborhood function calculates lattice distance to a neighbor neuron v. (Gaussian Function)
• L(t) - learning rate (set to .7)
• Iterated over input data 10 times
Distance Equations
• Distance function:
• w - weight vectors for neurons i and j
• Neighborhood Area Size:
• Top, Left, Right, Bottom Neurons
Unsupervised Anomaly Prediction
• System states:
• Normal
• Pre-Failure
• Failure
Anomaly Prediction
• Threshold based classification to the 3 system states based on neighborhood area size
• Threshold value selected based on percentile (85th-percentile) of the sorted neighborhood area sizes
• Alarm only after 3 consecutive anomalous samples
Anomaly Cause Inference• Indicated by difference between nearby normal
states and the anomalous state. Not exact root cause.
• Distance metrics for 5 normal neurons near the anomalous neuron.
• Sort the metrics high to low
• Each neuron votes which feature is the cause of the problem.
Decentralized
• The learning method is run inside a VM
• Uses residual resources
• Monitors resources and moves to a host with sufficient resources to learn
Experimental Setup• RUBiS online auction benchmark
• NASA web server trace July 1995 for request rate
• SLO violation if average request response time >100ms
• Faults
• Memleak - memory intensive program on VM running database
• CpuLeak - gradually increasing cpu consumption competes with database CPU
• NetHog - large number of http requests to the web server
Experimental Setup• IBM System S - high-performance data stream processing system
(SLO average processing time < 20ms)
• ClarkNet web server trace from August 1995 modulate data arrival rate
• Faults
• MemLeak - start a memory-intensive program in one randomly selected processing element (PE)
• CpuHog - CPU bound program competes with a random PE
• Bottleneck - set a low CPU cap for the VM running a random PE
Experimental Setup• Hadoop - sorting application (sample app)
• SLO violation is marked when job does not make progress.
• 3 VMs for Map and 6 for Reduce
• 12 GB of data to process
• Faults:
• MemLeak - memory leak bug in map tasks memory is allocated from heap without releasing
• CpuHog - inject infinite loop bug into all map tasks
Experiment Measures
• ROC - Receiver Operating Characteristic Curves
• tradeoff between true positive rate and false positive rate
• Achieved lead time - amount of time prior to a SLO violation occurring
Comparisons
• PCA (Principle Component Analysis)
• k-NN scheme (k-nearest neighbor)
• Both need normal and anomalous data unlike SOM which only needs normal data
Prediction Accuracy Results
• In all experiments the SOM method achieves better prediction accuracy than PCA and k-NN
RUBiS IBM System S Hadoop
* (UBL-kPtS) UBL scheme using the k-point moving average smoothing
Lead Time Results
• In all experiments the SOM, UBL method achieves the highest lead time prediction
Anomaly Cause Inference Results
• System S achieves near perfect inference.
Scalability Results
System Overhead
Overall UBL is lightweight
Conclusions• Black-box unsupervised behavior learning and
anomaly prediction for IaaS
• Predict unknown performance anomalies
• Provides hints to causes of anomalies
• Prediction accuracy up to 98% true positive rate and 1.7% false positive rate
• Advanced alarms with up to 47s lead time
Critique• Different method for initializing weights for SOM rather than random (Principle
Components Initialization)
• Was the method able to maintain the SLOs?
• What were the features?
• “For each fault injection we repeated the experiment 30-40 times.” Not useful for repeatability.
• No confidence intervals on the lead times
• K-NN is a supervised learning algorithm. Why not compare to another unsupervised learning method?
• What value of k was used?
• Did they mean k-means?
Interesting
• Overall a very interesting paper
• This method is patented
• Recently licensed to Google!
Xiaohui (Helen) Gu: http://www.csc.ncsu.edu/faculty/gu/
Thank You
• Questions?