Statistical Data Reduction for Efficient Application Performance Monitoring Statistical Data...

Statistical Data Reduction Statistical Data Reduction for Efficient Application Performance for Efficient Application Performance

Monitoring Monitoring

Lingyun Yang, Jennifer M. Schopf, Lingyun Yang, Jennifer M. Schopf,

Catalin L. Dumitrescu, Ian FosterCatalin L. Dumitrescu, Ian Foster

University of ChicagoUniversity of Chicago

Argonne National LaboratoryArgonne National Laboratory

Introduction In distributed and shared systems

– Performance of resources change dynamically

– Variability in resource performance can have a major influence on application performance

To deliver dependable and sustained performance to applications – Performance monitoring and anomaly diagnosis are

necessary

What is the problem?

System can be characterized by a set of system metrics:

– M=(m1, m2,… mn;)

– Example: (cpu load, band, free mem size, # of opened file… ) Application performance can be described quantitatively

by a performance metric: Y– Example: number of computations finished in unit time

Monitor the performance of system components ( value of M), such that we can diagnose the reason if an anomaly happens in application performance (value of Y).

Solution Challenges

– Computer systems and applications continue to increase in complexity and size

– Interactions among components are poor understood– Instrumentation will produce tremendous volumes of

data > Result in complexity for data analysis and anomaly

diagnosis.

Requires a data reduction strategy:

– Reduce the number of system metrics that a monitoring system must manage (necessary)

– Retain interesting characteristics of performance data (sufficient)

Outline Problems >Data Reduction Strategy

– Two observations

– Redundant system metrics reduction

– Statistical Variable Selection

Experiments Conclusion

Two Observations Some system metrics may capture the same or

similar information– They are correlated each other

– Only one is necessary, the other is redundant

Not all system metrics will be related with a particular application performance – Some system metrics are unrelated to the performance

of application, so unnecessary.

Two steps data reduction strategy

Redundant system metrics reduction

Clustering based method:– Use correlation coefficient (r) to measure the degree

of correlation between two system metrics

– Group metrics with high correlation coefficient into clusters

– Eliminate all but one of those metrics in one cluster

Two questions:– A threshold value t ( determined experimentally)

– A method to compare

How to compareHow to compare

Traditional method: Mathematical comparison

– r >t ? Problems:

– Only limited number of sample data are available

– r may change using data collected during different runs.

May eliminate uncorrelated metrics only by chance.

Sample correlation coefficient between the number of transfers issued per second and the number of memory pages cached per second for 20 runs of cactus application

Z-testZ-test

Reduce false error given limited number of sample data.

And avoid group uncorrelated metrics into one cluster

Z-test– A statistical method

– Determine whether an observed correlation is statistically significant larger than threshold value (95% confidence in my work).

Redundant metrics reduction Alg.

Given a set of samples, we proceed as follows. – Perform the Z-test for correlation coefficient between

every pair of system metrics.

– Group two metrics into one cluster only when the absolute value of their correlation coefficient is statistical significantly larger than the threshold value.

– The result of this computation is a set of system metric clusters.

– System metrics in each cluster are strongly correlated, so only one metric from the cluster can be used as the representative of the cluster while the others are deleted as redundant.

Outline Problems Data Reduction Strategy

– Two observations

– Redundant system metrics reduction

– > Statistical Variable Selection

Experiments Conclusion

Statistical Variable Selection Statistical Variable Selection

Some of these system metrics may not relate to our chosen performance metric

Identify the subset of all system metrics that are necessary to capture the performance metric

This form of data reduction is also known as variable selection

We use the Backward Elimination (BE) stepwise regression method to select the system metrics

BE stepwise regression methodBE stepwise regression method

System metrics concerned X=(x1, x2,… xn)

The application performance metric y Steps:

1. Y=0+1x1+2x2+…nxn

2. Which xi is the most useless in this model? By calculating the F value of each xi The F value of each xi captures its contribution to the model

3. Is the smallest F value < predefined significant value ? If yes, delete according xi, go to 1.

4. All metrics left are useful when to capture the variation of Y.

Outline Problems Data Reduction Strategy >Experiments

– Application and data collection

– Two criteria

– Experiment methodology

– Results Conclusion

Application and Data Collection Application and Data Collection

Application: Cactus Testbed: six Linux machines on UCSD Data collected at 0.033HZ for 24 hours Every data point include 600+ system metric value

and 1 application performance value Collect system metrics on each machine using

three utilities:– (1) The sar command of the SYSSTAT tool set,

– (2) Network weather service (NWS) sensors, and

– (3) The Unix command ping

Two criteriaTwo criteria

Reduction degree (RD) --necessary– Total percentage of system metrics eliminated

coefficient of determination ( R2 )-- sufficient– A statistical measurement

– Indicates the fraction of the total variability in the performance of application, that can be explained by the system metrics selected.

– Larger R2 value means system metrics selected can better capture the variation of performance of application.

Experiment methodologyExperiment methodology 24 hour long data is partitioned into 12 equal-sized chunks. Using the first chunk of data as the training data,the left 11

chunks of data as the verification data.2 steps experiment: Data Reduction

– Using training data to select system metrics. Verification: Is these system metrics sufficient? Is the result stable? How is this method compared with other strategies?

– RAND, randomly picks a subset of system metrics equal in number to those selected by our strategy

– MAIN, uses a subset of 75 system metrics that are commonly used to model the performance of applications by other works.

Data Reduction using training dataData Reduction using training data

Threshold , RD . Since fewer system metrics group into clusters and thus are removed as redundant

R2 , Since more information is available to model the application performance

RD=0.78, R2 = 0.98. when the threshold value = 0.95 A total of 141 of the original 628 system metrics were selected

System metrics selected on one machineSystem metrics selected on one machine Name Measurementwtps Total number of write requests per second issued to the physical disk.activepg Number of active (recently touched) pages in memoryproc/s Total number of processes created per second.rxpck/s Total number of packets received per second txpck/s Total number of packets transmitted per second.coll/s Number of collisions that happened per second while transmitting packets.kbbuffers Amount of memory used as buffers by the kernel in kilobytes.ip-frag Number of IP fragments currently in use. runq-sz Run queue length (number of processes waiting for run time)ldavg-5 System load average for the past 5 minutes.ldavg-15 System load average for the past 15 minutes. campg/s Number of additional memory pages cached by the system per second. dentunusd Number of unused cache entries in the directory cache.file-sz Number of used file handles. Rtsig-sz Number of queued RT signals. cswch/s Number of context switches per second. Latency Amount of time required to transmit a TCP message to target machinebandwidth Speed with which data can be sent to a target machine per secondAvailCPU Fraction of CPU available to a newly-started process.FreeMem Amount of space unused in memory

VerificationVerification

R2 value of SDR, MAIN and RAND

SDR exhibited an average R2 value of 0.907 55.0% and 98.5% higher than those of RAND and MAIN System metrics selected by SDR are significantly more

efficient than the alternatives for capturing Cactus performance

Verification Results AnalysisVerification Results Analysis

The system metrics selected by our strategy is:– Sufficient to capture the variation in the application

performance (average R2 value of 0.907)

– Stable (high R2 value over a far long time:24 hours)

– Better than the other two strategies concerned.

Conclusion

Statistical data reduction strategy – Reduce redundant system metrics which conveying

the same information> Cluster based method +Z test

– Reduce the unnecessary system metrics which are unrelated to the performance of applications

> BE stepwise regression method

Identify system metrics that are:– Only necessary ( high reduction degree value)

– And sufficient to capture application behavior( higher R2 value than other strategies)

Contact

Lingyun Yang: [email protected] Jennifer M . Schopf: [email protected] Catalin L. Dumitrescu: [email protected] Ian Foster: [email protected]

mailto:[email protected]



Statistical Data Reduction for Efficient Application Performance Monitoring Statistical Data...

Documents

Transcript of Statistical Data Reduction for Efficient Application Performance Monitoring Statistical Data...