UBL: unsupervised behavior learning for predicting …menasce/cs788/slides/wicke-d-UBL.pdf · UBL:...

UBL: unsupervised behavior learning for predicting

performance anomalies in virtualized cloud systems

Dean, Daniel Joseph and Nguyen, Hiep and Gu, Xiaohui, Proceedings of the 9th international conference on Autonomic computing (ICAC'12), San Jose,

California, USA, 2012.

Summarized by: Drew Wicke November 30th 2015

Overview• Introduction

• Self-Organizing Maps (SOMs)

• Experimental Setup

• Experiments and Results

• Conclusion & Critique

Introduction• Problem:

• Anomaly prediction in IaaS (Infrastructure as a service) clouds

• Challenges

• VMs are black boxes to the provider

• Thousands of concurrent jobs

• Impossible to get labeled training data

Unsupervised Learning

• Unlabeled training data = No reward or error signal

• Clustering (K-means, DBSCAN, Birch, etc.)

• Latent Variables (Principle Component Analysis)

• Neural Networks (Self-Organizing Maps, Adaptive Resonance Theory)

Self-Organizing Maps

• No labeled training data

• Maps high dimensional space to low dimensions

• Keeps topological order

• Predict both known and unknown anomalies

SOM Training• Data is normalized to [0-100]

• 32 x 32 lattice network (1024 total neurons)

• Weights are randomly initialized to [0,100]

• K-Fold cross validation for learning phase (K = 3)

• Maps data to neuron using euclidean distance metric

Weight Update• Weight update: W(t+1) =W(t)+N(v, t)L(t)(D(t)−W(t))

• W(t) - weight at time t

• D(t) - data input vector

• N(v, t) - neighborhood function calculates lattice distance to a neighbor neuron v. (Gaussian Function)

• L(t) - learning rate (set to .7)

• Iterated over input data 10 times

Distance Equations

• Distance function:

• w - weight vectors for neurons i and j

• Neighborhood Area Size:

• Top, Left, Right, Bottom Neurons

Unsupervised Anomaly Prediction

• System states:

• Normal

• Pre-Failure

• Failure

Anomaly Prediction

• Threshold based classification to the 3 system states based on neighborhood area size

• Threshold value selected based on percentile (85th-percentile) of the sorted neighborhood area sizes

• Alarm only after 3 consecutive anomalous samples

Anomaly Cause Inference• Indicated by difference between nearby normal

states and the anomalous state. Not exact root cause.

• Distance metrics for 5 normal neurons near the anomalous neuron.

• Sort the metrics high to low

• Each neuron votes which feature is the cause of the problem.

Decentralized

• The learning method is run inside a VM

• Uses residual resources

• Monitors resources and moves to a host with sufficient resources to learn

Experimental Setup• RUBiS online auction benchmark

• NASA web server trace July 1995 for request rate

• SLO violation if average request response time >100ms

• Faults

• Memleak - memory intensive program on VM running database

• CpuLeak - gradually increasing cpu consumption competes with database CPU

• NetHog - large number of http requests to the web server

Experimental Setup• IBM System S - high-performance data stream processing system

(SLO average processing time < 20ms)

• ClarkNet web server trace from August 1995 modulate data arrival rate

• Faults

• MemLeak - start a memory-intensive program in one randomly selected processing element (PE)

• CpuHog - CPU bound program competes with a random PE

• Bottleneck - set a low CPU cap for the VM running a random PE

Experimental Setup• Hadoop - sorting application (sample app)

• SLO violation is marked when job does not make progress.

• 3 VMs for Map and 6 for Reduce

• 12 GB of data to process

• Faults:

• MemLeak - memory leak bug in map tasks memory is allocated from heap without releasing

• CpuHog - inject infinite loop bug into all map tasks

Experiment Measures

• ROC - Receiver Operating Characteristic Curves

• tradeoff between true positive rate and false positive rate

• Achieved lead time - amount of time prior to a SLO violation occurring

Comparisons

• PCA (Principle Component Analysis)

• k-NN scheme (k-nearest neighbor)

• Both need normal and anomalous data unlike SOM which only needs normal data

Prediction Accuracy Results

• In all experiments the SOM method achieves better prediction accuracy than PCA and k-NN

RUBiS IBM System S Hadoop

* (UBL-kPtS) UBL scheme using the k-point moving average smoothing

Lead Time Results

• In all experiments the SOM, UBL method achieves the highest lead time prediction

Anomaly Cause Inference Results

• System S achieves near perfect inference.

Scalability Results

System Overhead

Overall UBL is lightweight

Conclusions• Black-box unsupervised behavior learning and

anomaly prediction for IaaS

• Predict unknown performance anomalies

• Provides hints to causes of anomalies

• Prediction accuracy up to 98% true positive rate and 1.7% false positive rate

• Advanced alarms with up to 47s lead time

Critique• Different method for initializing weights for SOM rather than random (Principle

Components Initialization)

• Was the method able to maintain the SLOs?

• What were the features?

• “For each fault injection we repeated the experiment 30-40 times.” Not useful for repeatability.

• No confidence intervals on the lead times

• K-NN is a supervised learning algorithm. Why not compare to another unsupervised learning method?

• What value of k was used?

• Did they mean k-means?

Interesting

• Overall a very interesting paper

• This method is patented

• Recently licensed to Google!

Xiaohui (Helen) Gu: http://www.csc.ncsu.edu/faculty/gu/

Thank You

• Questions?

UBL: unsupervised behavior learning for predicting …menasce/cs788/slides/wicke-d-UBL.pdf · UBL:...

Documents

Transcript of UBL: unsupervised behavior learning for predicting …menasce/cs788/slides/wicke-d-UBL.pdf · UBL:...