RayWicks(IBMUSA) Trending HO

7/29/2019 RayWicks(IBMUSA) Trending HO

1/12

8/5/200

Trending CMG Brazil (c) Ray Wicks

2008

IBM 2008

Predictive Statistics (Trending)

a Tutorial

CMG Brazil

Ray Wicks

561-236-5846

[email protected]

[email protected]

IBM 2008

Trade Marks, Copyrights & Stuff

On foils that appear in this presentationare not in the handout. This is to preventyou from looking ahead and spoiling myjokes and surprises.

This presentation is copyright by Ray Wicks 2008.

Many terms are trademarks of different companiesand are owned by them.

This session is sponsored by

IBM 2008

Abstract

Predictive Statistics (Trending) A Tutorial

This session reviews some of the trending techniques which can beuseful in capacity planning. The introduction of the basic statisticalconcept of regression analysis will examined. The simple linearregression analysis will be shown.

This session is sponsored by

IBM 2008

How Accurate Is It?

Time

Prediction

t0

Starting from an initial point of maybe dubious accuracy, we apply a growth

rate (also dubious) and then recommend actions costing lots of money.


2/12

8/5/200


2008

IBM 2008

Accuracy

Timet0Time

Prediction

t0

Accuracy is found in values that are close to the expected curve. This closeness

implies an expected bound or variation in reality. So a thicker line makes sense.IBM 2008

How Accurate Is It?

Time

Prediction

t0 t

p

Time

Prediction

t0 t

p

At time t, is the prediction a precise point p or a fuzzy patch?

IBM 2008

Statistical Discourse

Blah, blah, blah

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -3 -2 -1 0 1 2 3 4

X

=Normdist(x,0,1,0)

Perceptual Structure

Conceptual Structure

IBM 2008

A Conversation

You: The answer is 42.67.

Them: I measured it and the answer is 42.663!

You: Give me a break.

Them: I just want to be exact.

You: OK the answer is around 42.67.

Them: How far around.

You: ????


3/12

8/5/200


2008

IBM 2008

Confidence Interval orHow Thick is the Line?

P[m-2s < X < m+2s] = 0.954

P[m-1.96s < X < m+1.96s] = 0.95 or 95%

[L,U] is called the 100(1-)% confidence interval.

1- is called the level of confidence associated

with [L,U]

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -3 -2 -1 0 1 2 3 4

X

=Normdist(x,0,1,0

)

Z/2

Time

Prediction

t0

IBM 2008

Confidence Interval

[ 1.96 /n , + 1.96 /n ]

[ z/2 /n , + z/2 /n ]

Using a Standard Normal Probability table,95% confidence (2 tail) is found by lookingfor a z score of 0.025.

In Excel: =Confidence(, , n)

=Confidence(0.5,1,100) = 1.96

IBM 2008

SummaryGiven a list of numbers X={Xi} i=1 to n

StatisticsTerm Formula Excel PS View

Count (number of items) n

=Count(X)

Number of points

plotted

Average X=Sum(X)/n =Average(X) Center of gravity

Median X[ROUND DOWN 1+N*0.5] =MEDIAN(X) Middle number

Variance V=(Xi-X)2)/n =Var(X) Spread of data

St andard Deviat ion s =SQRT( V) = Stnd( X) Spr ead of dat a

Coeficient of Variation

(Std/Avg) CV=s/X

Spread of data around

average

Minimum First in Sorted list =MIN(X) Bottom of plot

Maximum Last in Sorted list =Max(X) Top of plot

Range

[Minimum,Maximum]

Distance between top

and bottom

90th percentile X[ROUND DOWN 1+n*0.9] =Percentile(X,0.9) 10% from the top

Confidence interval

Look in book =Confidence(0.05,s,n)Expected Variability of

average (a thick line)

= Percentile formulae

assume a sorted list; Low

to high.

IBM 2008

Linear Regression (for Trending)

y = 3.0504x + 385.42

R2

= 0.7881

0

100

200

300

400

500

600

700

800

900

1000

0 50 100 150 200

Week

MIPSUsed

Obtain a useful fit of the data (y= mx+b) and then extend the valuesof X to obtain predicted values of Y. But remember as Niels Bohrsaid: Prediction is very hard to do. Especial ly about the future.


4/12

8/5/200


2008

IBM 2008

Trending Assumptions & Questions

The future will be like the past. How much history is too much?You should look at Era segments. Shape and scale of graph can beinteresting.You may need more thannumbers.... The business andtechnical environment? Be smart and lazy. Whatquestions are you answering?

0

10

20

30

40

50

60

70

80

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 11 0 12 0 1 30 1 40 1 5 0

Week

CPU%

IBM 2008

Reality

y = 3.0504x + 385.42

R2

= 0.7881

0

200

400

600

800

1000

1200

1400

1600

1800

0 50 100 150 200

Week

MIPSUsed

Linear regressions predictions assume that

the future looks like the past.

IBM 2008

Coding ImplementationThe Butterfly Effect

Algorithm 1:Xn+1 = s*Xn if Xn < 0.5

Xn+1 = s*(1- Xn) otherwise

In Excel: cell Xn+1 is =IF(Xn


5/12

8/5/200


2008

IBM 2008

Excel Help

Search Excel Help forR Squaredreturn:

RSQ: Returns the square of the Pearson productmoment correlation coefficient through data pointsin known_y's and known_x's. For moreinformation, see PEARSON. The r-squared valuecan be interpreted as the proportion of thevariance in y attributable to the variance in x.

IBM 2008

Correlation

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100

CPU%

DASDI/ORate

Correlation = COV(X,Y) / x y= xy

2 / x y= E[(x-x)(y-y)] / x y

Correlation [-1,1]=CORREL(CPU%,DASDIO) = 0.86

IBM 2008

Briefly: Correlation is not Causality

Cause Effect (sufficient cause)~Effect ~Cause (necessary cause)

R2 or CORR(C,E) may indicate a linearrelationship without there being a causalconnection.

In cities of various sizes: C = number of TVs is highly correlated with E =number of murders.

C = religious events is highly correlated with E =number of suicides.

IBM 2008

Causality & CorrelationClaim: Eating Cheerios will lower your cholesterol

Cause Effect

Cause: Eating Cheerios

Effect: Lower Cholesterol

Test: Real cause

Intervening Variable

Cheerios Lower Cholesterol

Bacon & Eggs Cholesterol

Bacon & Eggs Lower Cholesterol

There is a correlation between Eating Cheerios and lower

Cholesterol but is there a causal relationship?

X


6/12

8/5/200


2008

IBM 2008

Matrix Solution for Linear FitB = (Mt * M)-1 * Mt * Y

Solve for Y = B0 + B1*X

X Y YH Sq (YH-YA) Sq (Y-YA) R2

M is 5x2 1 1.3 62.3 61.765 50.339025 43.0336 0.9262 =(SUM(F3:F7)/SUM(G3:G7))

1 1.4 64.3 66.495 5.593225 20.7936

1 1.45 70.8 68.86 5.7678E-24 3.7636

1 1.5 71.1 71.225 5.593225 5.0176

1 1.6 75.8 75.955 50.339025 48.1636

Avg 68.86

MT is 2x5 1 1 1 1 1 ctl-shift-enter

1.3 1.4 1 .45 1.5 1.6

MT*M is 2x2 5 7.25

7.25 10.563

INV(MTM) is 2x2 42.25 -29

-29 20

I MT M*MT i s 2 x5 4. 55 1. 65 0. 2 - 1. 25 -4 .15

-3 -1 0 1 3

IMTMMT*Y is 2x1 0.275 B0

47.3 B1

IBM 2008

Excel Solution

y = 47.3x + 0.275

R2 = 0.9262

50

55

60

65

70

75

80

1.2 1.3 1.4 1.5 1.6 1.7

Units of Work

CPU%

IBM 2008

Impact of Outlier

y = -50.8x + 149.06

R2

= 0.2358

50

55

60

65

70

75

80

85

90

95

100

1.2 1.3 1.4 1.5 1.6 1.7

Units of Work

CPU%

IBM 2008

A perfect fit is always possible

y = 58111x4

- 338194x3

+ 736689x2

- 711801x + 257442

R2

= 1

50

55

60

65

70

75

80

1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65Units of Work

CPU%

Albeit meaningless in this case.


7/12

8/5/200


2008

IBM 2008

Confidence of Fit.

y= 47.3x + 0.275

R2

= 0.9262

50

55

60

65

70

75

80

85

1.2 1.3 1.4 1.5 1.6 1.7

Units of Work

CPU% CPU%

LB

UB

Linear (CPU%)

IBM 2008

SAS

IBM 2008

Analyze -> Linear Regression

IBM 2008

Run

2.50236CoeffVar

0.9017Adj R-Sq68.86000Dependent Mean

0.9262R-Square1.72313Root MSE

0.00876.147.7060647.300001XX

0.98200.0211.200330.275001InterceptIntercept

Pr > |t|t ValueStandardError

ParameterEstimate

DFLabelVariable

Parameter Estimates


8/12

8/5/200


2008

IBM 2008

Results

IBM 2008

Residuals

For each Xi, plot e = Y-Yi

Residual

-20

-15

-10

-5

0

5

10

0 100 200 300 400 500 600 700 800 900

Units of Work

Residual

Look forrandomdistributionaround 0

IBM 2008

Interesting Case

y = 0.0335x

R2

= 0.8569

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800

Blocks

CPU%

Notice the points are below the line until >600. Typical of DB/DC. Means less

efficient as the load increases? The residuals have a pattern. That usuallymeans a second level effect.

IBM 2008

Regression other than Linear

y = 1.234e0.0043x

R2

= 0.9457

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800

Blocks

CPU%

Exponential fit is useful when computing compound growth


9/12

8/5/200


2008

IBM 2008

Perceptual to Conceptual Dissonance?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

05/21/04

05/28/04

06/04/04

06/11/04

06/18/04

06/25/04

07/02/04

07/09/04

07/16/04

07/23/04

07/30/04

08/06/04

08/13/04

08/20/04

08/27/04

09/03/04

09/10/04

09/17/04

09/24/04

10/01/04

10/08/04

10/15/04

10/22/04

10/29/04

11/05/04

(PS: Its a line)

y = -0.0002x + 8.2996

R2 = 0.4388 (CS: Not a good line)IBM 2008

Perceptual to Conceptual Dissonance

0.74

0.76

0.78

0.8

0.82

0.84

05/21/04

05/28/04

06/04/04

06/11/04

06/18/04

06/25/04

07/02/04

07/09/04

07/16/04

07/23/04

07/30/04

08/06/04

08/13/04

08/20/04

08/27/04

09/03/04

09/10/04

09/17/04

09/24/04

10/01/04

10/08/04

10/15/04

10/22/04

10/29/04

11/05/04

y = -0.0002x + 8.2996

R2 = 0.4388 (CS: Variability is scale independent)

(PS: Visual Variability is scale dependent)

IBM 2008

PS to CS Dissonance

y = -6E-08x3 + 0.0063x2 - 241.55x + 3E+06R2 = 0.7817 (CS: fit looks good)

0.72

0.74

0.76

0.78

0.8

0.82

0.84

05/21/04

05/28/04

06/04/04

06/11/04

06/18/04

06/25/04

07/02/04

07/09/04

07/16/04

07/23/04

07/30/04

08/06/04

08/13/04

08/20/04

08/27/04

09/03/04

09/10/04

09/17/04

09/24/04

10/01/04

10/08/04

10/15/04

10/22/04

10/29/04

11/05/04

(PS: Polynomial fit looks good)

IBM 2008

???

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

05/21/04

06/04/04

06/18/04

07/02/04

07/16/04

07/30/04

08/13/04

08/27/04

09/10/04

09/24/04

10/08/04

10/22/04

11/05/04

11/19/04

12/03/04

12/17/04

12/31/04

01/14/05

01/28/05

02/11/05

02/25/05

03/11/05

03/25/05

In 144 Days, the $ will be worthless.


10/12

8/5/200


2008

IBM 2008

Regression Analysis is not a Crystal Ball

1.28

1.29

1.3

1.31

1.32

1.33

1.34

1.35

1.36

1.37

1/18/07 2/7/07 2/27/07 3/19/07 4/8/07 4/28/07 5/18/07 6/7/07 6/27/07 7/17/07

IBM 2008

Philosophical Remark

In reaching a conclusion, we negotiate between thepotential perceptual structures and the potentialconceptual structures and memory events.

Sensation

Context

(Lights

Up)

Negotiation y=-0.0002x+8.2996R2 =0.4388

0.74

0.75

0.76

0.77

0.78

0.79

0.8

0.81

0.82

0.83

0.84

IBM 2008

Model Building: Which is Best?X1 X2 X3 X4 Y

7 26 6 60 78.5

1 29 15 52 74.3

11 56 8 20 104.3

11 31 8 47 87.6

7 52 6 33 95.9

11 55 9 22 109.2

3 71 17 6 102.7

1 31 22 44 72.5

2 54 18 22 93.1

21 47 4 26 115.9

1 40 23 34 83.8

11 66 9 12 113.3

10 68 8 12 109.4

Stepwise procedure to find the best combination of variables.Y = b + a1X1

Y = b + a1X1 + a2X2Y = b + a2X2 + a3X3

Y = b + a1X1 + a2X2 + a3X3 + a4X4 Using HaldData from Draper

IBM 2008

Stepwise ResultsStepwise Analysis

Table of Results for General Stepwise

X4 entered.

df SS MS F Significance F

Regress ion 1 1831.89616 1831.89616 22.7985202 0.000576232

R es id ua l 1 1 8 83 .8 66 91 69 8 0. 35 15 37 9

Total 12 2715.763077

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 117.5679312 5.262206511 22.34194552 1.62424E-10 105.9858927 129.1499696

X4 -0.738161808 0.154595996 -4.774779597 0.000576232 -1.078425302 -0.397898315

X1 entered.

df SS MS F Significance F

Regression 2 2641.000965 1320.500482 176.6269631 1.58106E-08

Res idual 10 74.76211216 7.476211216

Total 12 2715.763077

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 103.0973816 2.123983606 48.53963154 3.32434E-13 98.36485126 107.829912

X4 -0.613953628 0.048644552 -12.62122063 1.81489E-07 -0.722340445 -0.505566811X1 1.439958285 0.13841664 10.40307211 1.10528E-06 1.131546793 1.748369777

No other variables could be entered into the model. Stepwise ends.

Using Add-In from Levine


11/12

8/5/200


2008

IBM 2008

Looking for I/O = F(MIPS).Dont give up too quickly

y = 2.4545x

R2

= 0.3726

0

2000

4000

6000

8000

10000

12000

14000

16000

1500 2000 2500 3000 3500 4000 4500

MIPS

I/O

Y intercept forced to 0.

IBM 2008

Look at ratio in time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0:00

1:00

2:00

3:00

4:00

5:00

6:00

7:00

8:00

9:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00

22:00

23:00

IO/MIPS

IBM 2008

Trending: What to Do?

Average In & Ready

0

5

10

15

20

25

30

35

40

0 100 200 300 400

90th%ile

IBM 2008

Options?

Average In & Ready

y = 7.2692e0.0042x

R2

= 0.6615

0

5

10

15

20

25

30

35

40

45

0 100 200 300 400 500

90th%ile

Linear (90th%ile)

Expon. (90th%ile)


12/12

8/5/200


2008

IBM 2008

How About A Polynomial?

Average In & Ready

0

10

20

30

40

50

60

70

80

90

100

0 100 200 300 400 500

90th%ile

Poly. (90th%ile)

A polynomial can be made to fit about any wandering data within the bounds of the data

[min,max]. Beyond the bounds, any prediction is suspect.

Y=b0 + b1X + b2X2 + b3X

3 + . + bnXn

IBM 2008

A time series is a sequence of observations which are ordered intime (or space). If observations are made on some phenomenonthroughout time, it is most sensible to display the data in theorder in which they arose, particularly since successiveobservations will probably be dependent. Time series are bestdisplayed in a scatter plot. The series value X is plotted on thevertical axis and time t on the horizontal axis. Time is called theindependent variable (in this case however, something overwhich you have little control).There are two kinds of time series data:1. Continuous, where we have an observation at every instant oftime e.g. lie detectors, electrocardiograms. We denote this usingobservation X at time t, X(t).2. Discrete, where we have an observation at (usually regularly)spaced intervals. We denote this as Xt.

Time Series

See http://www.cas.lancs.ac.uk/glossary_v1.1/tsd.html#timeseries

IBM 2008

Bibliography Applied Regression Analysis, Draper & Smith, Wiley. Definitivesource for regression analysis. Highly technical.

Statistical Concepts and Methods, Bhattacharyya & Johnson, Wiley,1977. This has both a discussion of meaning and the formulae.

Applied Statistics for Engineers and Scientists, Levine, Ramsey &Smidt, Prentice Hall, 2001. This has a good approach to statistics andExcel implementations. CD comes with the book which has somepowerful Excel Add-ins.

The Art of Computer Systems Performance Analysis, by Raj Jain,Wiley. I like this one. For performance analysis and capacity planning,it is thorough and complete. A very good reference. It may be hard tofind.

Chaos Under Control, by Peak & Frame, Freeman & Co.

http://www.itl.nist.gov/div898/handbook/pmc/pmc.htm is a good web

site to explore statistics.

RayWicks(IBMUSA) Trending HO

Documents

Transcript of RayWicks(IBMUSA) Trending HO