RayWicks(IBMUSA) Trending HO

download RayWicks(IBMUSA) Trending HO

of 12

Transcript of RayWicks(IBMUSA) Trending HO

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    1/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Predictive Statistics (Trending)

    a Tutorial

    CMG Brazil

    Ray Wicks

    561-236-5846

    [email protected]

    [email protected]

    IBM 2008

    Trade Marks, Copyrights & Stuff

    On foils that appear in this presentationare not in the handout. This is to preventyou from looking ahead and spoiling myjokes and surprises.

    This presentation is copyright by Ray Wicks 2008.

    Many terms are trademarks of different companiesand are owned by them.

    This session is sponsored by

    IBM 2008

    Abstract

    Predictive Statistics (Trending) A Tutorial

    This session reviews some of the trending techniques which can beuseful in capacity planning. The introduction of the basic statisticalconcept of regression analysis will examined. The simple linearregression analysis will be shown.

    This session is sponsored by

    IBM 2008

    How Accurate Is It?

    Time

    Prediction

    t0

    Starting from an initial point of maybe dubious accuracy, we apply a growth

    rate (also dubious) and then recommend actions costing lots of money.

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    2/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Accuracy

    Timet0Time

    Prediction

    t0

    Accuracy is found in values that are close to the expected curve. This closeness

    implies an expected bound or variation in reality. So a thicker line makes sense.IBM 2008

    How Accurate Is It?

    Time

    Prediction

    t0 t

    p

    Time

    Prediction

    t0 t

    p

    At time t, is the prediction a precise point p or a fuzzy patch?

    IBM 2008

    Statistical Discourse

    Blah, blah, blah

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    -4 -3 -2 -1 0 1 2 3 4

    X

    =Normdist(x,0,1,0)

    Perceptual Structure

    Conceptual Structure

    IBM 2008

    A Conversation

    You: The answer is 42.67.

    Them: I measured it and the answer is 42.663!

    You: Give me a break.

    Them: I just want to be exact.

    You: OK the answer is around 42.67.

    Them: How far around.

    You: ????

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    3/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Confidence Interval orHow Thick is the Line?

    P[m-2s < X < m+2s] = 0.954

    P[m-1.96s < X < m+1.96s] = 0.95 or 95%

    [L,U] is called the 100(1-)% confidence interval.

    1- is called the level of confidence associated

    with [L,U]

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    -4 -3 -2 -1 0 1 2 3 4

    X

    =Normdist(x,0,1,0

    )

    Z/2

    Time

    Prediction

    t0

    IBM 2008

    Confidence Interval

    [ 1.96 /n , + 1.96 /n ]

    [ z/2 /n , + z/2 /n ]

    Using a Standard Normal Probability table,95% confidence (2 tail) is found by lookingfor a z score of 0.025.

    In Excel: =Confidence(, , n)

    =Confidence(0.5,1,100) = 1.96

    IBM 2008

    SummaryGiven a list of numbers X={Xi} i=1 to n

    StatisticsTerm Formula Excel PS View

    Count (number of items) n

    =Count(X)

    Number of points

    plotted

    Average X=Sum(X)/n =Average(X) Center of gravity

    Median X[ROUND DOWN 1+N*0.5] =MEDIAN(X) Middle number

    Variance V=(Xi-X)2)/n =Var(X) Spread of data

    St andard Deviat ion s =SQRT( V) = Stnd( X) Spr ead of dat a

    Coeficient of Variation

    (Std/Avg) CV=s/X

    Spread of data around

    average

    Minimum First in Sorted list =MIN(X) Bottom of plot

    Maximum Last in Sorted list =Max(X) Top of plot

    Range

    [Minimum,Maximum]

    Distance between top

    and bottom

    90th percentile X[ROUND DOWN 1+n*0.9] =Percentile(X,0.9) 10% from the top

    Confidence interval

    Look in book =Confidence(0.05,s,n)Expected Variability of

    average (a thick line)

    = Percentile formulae

    assume a sorted list; Low

    to high.

    IBM 2008

    Linear Regression (for Trending)

    y = 3.0504x + 385.42

    R2

    = 0.7881

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    0 50 100 150 200

    Week

    MIPSUsed

    Obtain a useful fit of the data (y= mx+b) and then extend the valuesof X to obtain predicted values of Y. But remember as Niels Bohrsaid: Prediction is very hard to do. Especial ly about the future.

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    4/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Trending Assumptions & Questions

    The future will be like the past. How much history is too much?You should look at Era segments. Shape and scale of graph can beinteresting.You may need more thannumbers.... The business andtechnical environment? Be smart and lazy. Whatquestions are you answering?

    0

    10

    20

    30

    40

    50

    60

    70

    80

    0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 11 0 12 0 1 30 1 40 1 5 0

    Week

    CPU%

    IBM 2008

    Reality

    y = 3.0504x + 385.42

    R2

    = 0.7881

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    0 50 100 150 200

    Week

    MIPSUsed

    Linear regressions predictions assume that

    the future looks like the past.

    IBM 2008

    Coding ImplementationThe Butterfly Effect

    Algorithm 1:Xn+1 = s*Xn if Xn < 0.5

    Xn+1 = s*(1- Xn) otherwise

    In Excel: cell Xn+1 is =IF(Xn

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    5/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Excel Help

    Search Excel Help forR Squaredreturn:

    RSQ: Returns the square of the Pearson productmoment correlation coefficient through data pointsin known_y's and known_x's. For moreinformation, see PEARSON. The r-squared valuecan be interpreted as the proportion of thevariance in y attributable to the variance in x.

    IBM 2008

    Correlation

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    0 20 40 60 80 100

    CPU%

    DASDI/ORate

    Correlation = COV(X,Y) / x y= xy

    2 / x y= E[(x-x)(y-y)] / x y

    Correlation [-1,1]=CORREL(CPU%,DASDIO) = 0.86

    IBM 2008

    Briefly: Correlation is not Causality

    Cause Effect (sufficient cause)~Effect ~Cause (necessary cause)

    R2 or CORR(C,E) may indicate a linearrelationship without there being a causalconnection.

    In cities of various sizes: C = number of TVs is highly correlated with E =number of murders.

    C = religious events is highly correlated with E =number of suicides.

    IBM 2008

    Causality & CorrelationClaim: Eating Cheerios will lower your cholesterol

    Cause Effect

    Cause: Eating Cheerios

    Effect: Lower Cholesterol

    Test: Real cause

    Intervening Variable

    Cheerios Lower Cholesterol

    Bacon & Eggs Cholesterol

    Bacon & Eggs Lower Cholesterol

    There is a correlation between Eating Cheerios and lower

    Cholesterol but is there a causal relationship?

    X

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    6/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Matrix Solution for Linear FitB = (Mt * M)-1 * Mt * Y

    Solve for Y = B0 + B1*X

    X Y YH Sq (YH-YA) Sq (Y-YA) R2

    M is 5x2 1 1.3 62.3 61.765 50.339025 43.0336 0.9262 =(SUM(F3:F7)/SUM(G3:G7))

    1 1.4 64.3 66.495 5.593225 20.7936

    1 1.45 70.8 68.86 5.7678E-24 3.7636

    1 1.5 71.1 71.225 5.593225 5.0176

    1 1.6 75.8 75.955 50.339025 48.1636

    Avg 68.86

    MT is 2x5 1 1 1 1 1 ctl-shift-enter

    1.3 1.4 1 .45 1.5 1.6

    MT*M is 2x2 5 7.25

    7.25 10.563

    INV(MTM) is 2x2 42.25 -29

    -29 20

    I MT M*MT i s 2 x5 4. 55 1. 65 0. 2 - 1. 25 -4 .15

    -3 -1 0 1 3

    IMTMMT*Y is 2x1 0.275 B0

    47.3 B1

    IBM 2008

    Excel Solution

    y = 47.3x + 0.275

    R2 = 0.9262

    50

    55

    60

    65

    70

    75

    80

    1.2 1.3 1.4 1.5 1.6 1.7

    Units of Work

    CPU%

    IBM 2008

    Impact of Outlier

    y = -50.8x + 149.06

    R2

    = 0.2358

    50

    55

    60

    65

    70

    75

    80

    85

    90

    95

    100

    1.2 1.3 1.4 1.5 1.6 1.7

    Units of Work

    CPU%

    IBM 2008

    A perfect fit is always possible

    y = 58111x4

    - 338194x3

    + 736689x2

    - 711801x + 257442

    R2

    = 1

    50

    55

    60

    65

    70

    75

    80

    1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65Units of Work

    CPU%

    Albeit meaningless in this case.

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    7/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Confidence of Fit.

    y= 47.3x + 0.275

    R2

    = 0.9262

    50

    55

    60

    65

    70

    75

    80

    85

    1.2 1.3 1.4 1.5 1.6 1.7

    Units of Work

    CPU% CPU%

    LB

    UB

    Linear (CPU%)

    IBM 2008

    SAS

    IBM 2008

    Analyze -> Linear Regression

    IBM 2008

    Run

    2.50236CoeffVar

    0.9017Adj R-Sq68.86000Dependent Mean

    0.9262R-Square1.72313Root MSE

    0.00876.147.7060647.300001XX

    0.98200.0211.200330.275001InterceptIntercept

    Pr > |t|t ValueStandardError

    ParameterEstimate

    DFLabelVariable

    Parameter Estimates

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    8/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Results

    IBM 2008

    Residuals

    For each Xi, plot e = Y-Yi

    Residual

    -20

    -15

    -10

    -5

    0

    5

    10

    0 100 200 300 400 500 600 700 800 900

    Units of Work

    Residual

    Look forrandomdistributionaround 0

    IBM 2008

    Interesting Case

    y = 0.0335x

    R2

    = 0.8569

    0

    5

    10

    15

    20

    25

    30

    35

    40

    0 100 200 300 400 500 600 700 800

    Blocks

    CPU%

    Notice the points are below the line until >600. Typical of DB/DC. Means less

    efficient as the load increases? The residuals have a pattern. That usuallymeans a second level effect.

    IBM 2008

    Regression other than Linear

    y = 1.234e0.0043x

    R2

    = 0.9457

    0

    5

    10

    15

    20

    25

    30

    35

    40

    0 100 200 300 400 500 600 700 800

    Blocks

    CPU%

    Exponential fit is useful when computing compound growth

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    9/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Perceptual to Conceptual Dissonance?

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    05/21/04

    05/28/04

    06/04/04

    06/11/04

    06/18/04

    06/25/04

    07/02/04

    07/09/04

    07/16/04

    07/23/04

    07/30/04

    08/06/04

    08/13/04

    08/20/04

    08/27/04

    09/03/04

    09/10/04

    09/17/04

    09/24/04

    10/01/04

    10/08/04

    10/15/04

    10/22/04

    10/29/04

    11/05/04

    (PS: Its a line)

    y = -0.0002x + 8.2996

    R2 = 0.4388 (CS: Not a good line)IBM 2008

    Perceptual to Conceptual Dissonance

    0.74

    0.76

    0.78

    0.8

    0.82

    0.84

    05/21/04

    05/28/04

    06/04/04

    06/11/04

    06/18/04

    06/25/04

    07/02/04

    07/09/04

    07/16/04

    07/23/04

    07/30/04

    08/06/04

    08/13/04

    08/20/04

    08/27/04

    09/03/04

    09/10/04

    09/17/04

    09/24/04

    10/01/04

    10/08/04

    10/15/04

    10/22/04

    10/29/04

    11/05/04

    y = -0.0002x + 8.2996

    R2 = 0.4388 (CS: Variability is scale independent)

    (PS: Visual Variability is scale dependent)

    IBM 2008

    PS to CS Dissonance

    y = -6E-08x3 + 0.0063x2 - 241.55x + 3E+06R2 = 0.7817 (CS: fit looks good)

    0.72

    0.74

    0.76

    0.78

    0.8

    0.82

    0.84

    05/21/04

    05/28/04

    06/04/04

    06/11/04

    06/18/04

    06/25/04

    07/02/04

    07/09/04

    07/16/04

    07/23/04

    07/30/04

    08/06/04

    08/13/04

    08/20/04

    08/27/04

    09/03/04

    09/10/04

    09/17/04

    09/24/04

    10/01/04

    10/08/04

    10/15/04

    10/22/04

    10/29/04

    11/05/04

    (PS: Polynomial fit looks good)

    IBM 2008

    ???

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    05/21/04

    06/04/04

    06/18/04

    07/02/04

    07/16/04

    07/30/04

    08/13/04

    08/27/04

    09/10/04

    09/24/04

    10/08/04

    10/22/04

    11/05/04

    11/19/04

    12/03/04

    12/17/04

    12/31/04

    01/14/05

    01/28/05

    02/11/05

    02/25/05

    03/11/05

    03/25/05

    In 144 Days, the $ will be worthless.

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    10/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Regression Analysis is not a Crystal Ball

    1.28

    1.29

    1.3

    1.31

    1.32

    1.33

    1.34

    1.35

    1.36

    1.37

    1/18/07 2/7/07 2/27/07 3/19/07 4/8/07 4/28/07 5/18/07 6/7/07 6/27/07 7/17/07

    IBM 2008

    Philosophical Remark

    In reaching a conclusion, we negotiate between thepotential perceptual structures and the potentialconceptual structures and memory events.

    Sensation

    Context

    (Lights

    Up)

    Negotiation y=-0.0002x+8.2996R2 =0.4388

    0.74

    0.75

    0.76

    0.77

    0.78

    0.79

    0.8

    0.81

    0.82

    0.83

    0.84

    IBM 2008

    Model Building: Which is Best?X1 X2 X3 X4 Y

    7 26 6 60 78.5

    1 29 15 52 74.3

    11 56 8 20 104.3

    11 31 8 47 87.6

    7 52 6 33 95.9

    11 55 9 22 109.2

    3 71 17 6 102.7

    1 31 22 44 72.5

    2 54 18 22 93.1

    21 47 4 26 115.9

    1 40 23 34 83.8

    11 66 9 12 113.3

    10 68 8 12 109.4

    Stepwise procedure to find the best combination of variables.Y = b + a1X1

    Y = b + a1X1 + a2X2Y = b + a2X2 + a3X3

    Y = b + a1X1 + a2X2 + a3X3 + a4X4 Using HaldData from Draper

    IBM 2008

    Stepwise ResultsStepwise Analysis

    Table of Results for General Stepwise

    X4 entered.

    df SS MS F Significance F

    Regress ion 1 1831.89616 1831.89616 22.7985202 0.000576232

    R es id ua l 1 1 8 83 .8 66 91 69 8 0. 35 15 37 9

    Total 12 2715.763077

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 117.5679312 5.262206511 22.34194552 1.62424E-10 105.9858927 129.1499696

    X4 -0.738161808 0.154595996 -4.774779597 0.000576232 -1.078425302 -0.397898315

    X1 entered.

    df SS MS F Significance F

    Regression 2 2641.000965 1320.500482 176.6269631 1.58106E-08

    Res idual 10 74.76211216 7.476211216

    Total 12 2715.763077

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 103.0973816 2.123983606 48.53963154 3.32434E-13 98.36485126 107.829912

    X4 -0.613953628 0.048644552 -12.62122063 1.81489E-07 -0.722340445 -0.505566811X1 1.439958285 0.13841664 10.40307211 1.10528E-06 1.131546793 1.748369777

    No other variables could be entered into the model. Stepwise ends.

    Using Add-In from Levine

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    11/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    Looking for I/O = F(MIPS).Dont give up too quickly

    y = 2.4545x

    R2

    = 0.3726

    0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    1500 2000 2500 3000 3500 4000 4500

    MIPS

    I/O

    Y intercept forced to 0.

    IBM 2008

    Look at ratio in time

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    0:00

    1:00

    2:00

    3:00

    4:00

    5:00

    6:00

    7:00

    8:00

    9:00

    10:00

    11:00

    12:00

    13:00

    14:00

    15:00

    16:00

    17:00

    18:00

    19:00

    20:00

    21:00

    22:00

    23:00

    IO/MIPS

    IBM 2008

    Trending: What to Do?

    Average In & Ready

    0

    5

    10

    15

    20

    25

    30

    35

    40

    0 100 200 300 400

    90th%ile

    IBM 2008

    Options?

    Average In & Ready

    y = 7.2692e0.0042x

    R2

    = 0.6615

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    0 100 200 300 400 500

    90th%ile

    Linear (90th%ile)

    Expon. (90th%ile)

  • 7/29/2019 RayWicks(IBMUSA) Trending HO

    12/12

    8/5/200

    Trending CMG Brazil (c) Ray Wicks

    2008

    IBM 2008

    How About A Polynomial?

    Average In & Ready

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 100 200 300 400 500

    90th%ile

    Poly. (90th%ile)

    A polynomial can be made to fit about any wandering data within the bounds of the data

    [min,max]. Beyond the bounds, any prediction is suspect.

    Y=b0 + b1X + b2X2 + b3X

    3 + . + bnXn

    IBM 2008

    A time series is a sequence of observations which are ordered intime (or space). If observations are made on some phenomenonthroughout time, it is most sensible to display the data in theorder in which they arose, particularly since successiveobservations will probably be dependent. Time series are bestdisplayed in a scatter plot. The series value X is plotted on thevertical axis and time t on the horizontal axis. Time is called theindependent variable (in this case however, something overwhich you have little control).There are two kinds of time series data:1. Continuous, where we have an observation at every instant oftime e.g. lie detectors, electrocardiograms. We denote this usingobservation X at time t, X(t).2. Discrete, where we have an observation at (usually regularly)spaced intervals. We denote this as Xt.

    Time Series

    See http://www.cas.lancs.ac.uk/glossary_v1.1/tsd.html#timeseries

    IBM 2008

    Bibliography Applied Regression Analysis, Draper & Smith, Wiley. Definitivesource for regression analysis. Highly technical.

    Statistical Concepts and Methods, Bhattacharyya & Johnson, Wiley,1977. This has both a discussion of meaning and the formulae.

    Applied Statistics for Engineers and Scientists, Levine, Ramsey &Smidt, Prentice Hall, 2001. This has a good approach to statistics andExcel implementations. CD comes with the book which has somepowerful Excel Add-ins.

    The Art of Computer Systems Performance Analysis, by Raj Jain,Wiley. I like this one. For performance analysis and capacity planning,it is thorough and complete. A very good reference. It may be hard tofind.

    Chaos Under Control, by Peak & Frame, Freeman & Co.

    http://www.itl.nist.gov/div898/handbook/pmc/pmc.htm is a good web

    site to explore statistics.