RayWicks(IBMUSA) Trending HO
-
Upload
jair-santos -
Category
Documents
-
view
219 -
download
0
Transcript of RayWicks(IBMUSA) Trending HO
-
7/29/2019 RayWicks(IBMUSA) Trending HO
1/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Predictive Statistics (Trending)
a Tutorial
CMG Brazil
Ray Wicks
561-236-5846
IBM 2008
Trade Marks, Copyrights & Stuff
On foils that appear in this presentationare not in the handout. This is to preventyou from looking ahead and spoiling myjokes and surprises.
This presentation is copyright by Ray Wicks 2008.
Many terms are trademarks of different companiesand are owned by them.
This session is sponsored by
IBM 2008
Abstract
Predictive Statistics (Trending) A Tutorial
This session reviews some of the trending techniques which can beuseful in capacity planning. The introduction of the basic statisticalconcept of regression analysis will examined. The simple linearregression analysis will be shown.
This session is sponsored by
IBM 2008
How Accurate Is It?
Time
Prediction
t0
Starting from an initial point of maybe dubious accuracy, we apply a growth
rate (also dubious) and then recommend actions costing lots of money.
-
7/29/2019 RayWicks(IBMUSA) Trending HO
2/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Accuracy
Timet0Time
Prediction
t0
Accuracy is found in values that are close to the expected curve. This closeness
implies an expected bound or variation in reality. So a thicker line makes sense.IBM 2008
How Accurate Is It?
Time
Prediction
t0 t
p
Time
Prediction
t0 t
p
At time t, is the prediction a precise point p or a fuzzy patch?
IBM 2008
Statistical Discourse
Blah, blah, blah
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -3 -2 -1 0 1 2 3 4
X
=Normdist(x,0,1,0)
Perceptual Structure
Conceptual Structure
IBM 2008
A Conversation
You: The answer is 42.67.
Them: I measured it and the answer is 42.663!
You: Give me a break.
Them: I just want to be exact.
You: OK the answer is around 42.67.
Them: How far around.
You: ????
-
7/29/2019 RayWicks(IBMUSA) Trending HO
3/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Confidence Interval orHow Thick is the Line?
P[m-2s < X < m+2s] = 0.954
P[m-1.96s < X < m+1.96s] = 0.95 or 95%
[L,U] is called the 100(1-)% confidence interval.
1- is called the level of confidence associated
with [L,U]
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -3 -2 -1 0 1 2 3 4
X
=Normdist(x,0,1,0
)
Z/2
Time
Prediction
t0
IBM 2008
Confidence Interval
[ 1.96 /n , + 1.96 /n ]
[ z/2 /n , + z/2 /n ]
Using a Standard Normal Probability table,95% confidence (2 tail) is found by lookingfor a z score of 0.025.
In Excel: =Confidence(, , n)
=Confidence(0.5,1,100) = 1.96
IBM 2008
SummaryGiven a list of numbers X={Xi} i=1 to n
StatisticsTerm Formula Excel PS View
Count (number of items) n
=Count(X)
Number of points
plotted
Average X=Sum(X)/n =Average(X) Center of gravity
Median X[ROUND DOWN 1+N*0.5] =MEDIAN(X) Middle number
Variance V=(Xi-X)2)/n =Var(X) Spread of data
St andard Deviat ion s =SQRT( V) = Stnd( X) Spr ead of dat a
Coeficient of Variation
(Std/Avg) CV=s/X
Spread of data around
average
Minimum First in Sorted list =MIN(X) Bottom of plot
Maximum Last in Sorted list =Max(X) Top of plot
Range
[Minimum,Maximum]
Distance between top
and bottom
90th percentile X[ROUND DOWN 1+n*0.9] =Percentile(X,0.9) 10% from the top
Confidence interval
Look in book =Confidence(0.05,s,n)Expected Variability of
average (a thick line)
= Percentile formulae
assume a sorted list; Low
to high.
IBM 2008
Linear Regression (for Trending)
y = 3.0504x + 385.42
R2
= 0.7881
0
100
200
300
400
500
600
700
800
900
1000
0 50 100 150 200
Week
MIPSUsed
Obtain a useful fit of the data (y= mx+b) and then extend the valuesof X to obtain predicted values of Y. But remember as Niels Bohrsaid: Prediction is very hard to do. Especial ly about the future.
-
7/29/2019 RayWicks(IBMUSA) Trending HO
4/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Trending Assumptions & Questions
The future will be like the past. How much history is too much?You should look at Era segments. Shape and scale of graph can beinteresting.You may need more thannumbers.... The business andtechnical environment? Be smart and lazy. Whatquestions are you answering?
0
10
20
30
40
50
60
70
80
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 11 0 12 0 1 30 1 40 1 5 0
Week
CPU%
IBM 2008
Reality
y = 3.0504x + 385.42
R2
= 0.7881
0
200
400
600
800
1000
1200
1400
1600
1800
0 50 100 150 200
Week
MIPSUsed
Linear regressions predictions assume that
the future looks like the past.
IBM 2008
Coding ImplementationThe Butterfly Effect
Algorithm 1:Xn+1 = s*Xn if Xn < 0.5
Xn+1 = s*(1- Xn) otherwise
In Excel: cell Xn+1 is =IF(Xn
-
7/29/2019 RayWicks(IBMUSA) Trending HO
5/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Excel Help
Search Excel Help forR Squaredreturn:
RSQ: Returns the square of the Pearson productmoment correlation coefficient through data pointsin known_y's and known_x's. For moreinformation, see PEARSON. The r-squared valuecan be interpreted as the proportion of thevariance in y attributable to the variance in x.
IBM 2008
Correlation
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100
CPU%
DASDI/ORate
Correlation = COV(X,Y) / x y= xy
2 / x y= E[(x-x)(y-y)] / x y
Correlation [-1,1]=CORREL(CPU%,DASDIO) = 0.86
IBM 2008
Briefly: Correlation is not Causality
Cause Effect (sufficient cause)~Effect ~Cause (necessary cause)
R2 or CORR(C,E) may indicate a linearrelationship without there being a causalconnection.
In cities of various sizes: C = number of TVs is highly correlated with E =number of murders.
C = religious events is highly correlated with E =number of suicides.
IBM 2008
Causality & CorrelationClaim: Eating Cheerios will lower your cholesterol
Cause Effect
Cause: Eating Cheerios
Effect: Lower Cholesterol
Test: Real cause
Intervening Variable
Cheerios Lower Cholesterol
Bacon & Eggs Cholesterol
Bacon & Eggs Lower Cholesterol
There is a correlation between Eating Cheerios and lower
Cholesterol but is there a causal relationship?
X
-
7/29/2019 RayWicks(IBMUSA) Trending HO
6/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Matrix Solution for Linear FitB = (Mt * M)-1 * Mt * Y
Solve for Y = B0 + B1*X
X Y YH Sq (YH-YA) Sq (Y-YA) R2
M is 5x2 1 1.3 62.3 61.765 50.339025 43.0336 0.9262 =(SUM(F3:F7)/SUM(G3:G7))
1 1.4 64.3 66.495 5.593225 20.7936
1 1.45 70.8 68.86 5.7678E-24 3.7636
1 1.5 71.1 71.225 5.593225 5.0176
1 1.6 75.8 75.955 50.339025 48.1636
Avg 68.86
MT is 2x5 1 1 1 1 1 ctl-shift-enter
1.3 1.4 1 .45 1.5 1.6
MT*M is 2x2 5 7.25
7.25 10.563
INV(MTM) is 2x2 42.25 -29
-29 20
I MT M*MT i s 2 x5 4. 55 1. 65 0. 2 - 1. 25 -4 .15
-3 -1 0 1 3
IMTMMT*Y is 2x1 0.275 B0
47.3 B1
IBM 2008
Excel Solution
y = 47.3x + 0.275
R2 = 0.9262
50
55
60
65
70
75
80
1.2 1.3 1.4 1.5 1.6 1.7
Units of Work
CPU%
IBM 2008
Impact of Outlier
y = -50.8x + 149.06
R2
= 0.2358
50
55
60
65
70
75
80
85
90
95
100
1.2 1.3 1.4 1.5 1.6 1.7
Units of Work
CPU%
IBM 2008
A perfect fit is always possible
y = 58111x4
- 338194x3
+ 736689x2
- 711801x + 257442
R2
= 1
50
55
60
65
70
75
80
1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65Units of Work
CPU%
Albeit meaningless in this case.
-
7/29/2019 RayWicks(IBMUSA) Trending HO
7/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Confidence of Fit.
y= 47.3x + 0.275
R2
= 0.9262
50
55
60
65
70
75
80
85
1.2 1.3 1.4 1.5 1.6 1.7
Units of Work
CPU% CPU%
LB
UB
Linear (CPU%)
IBM 2008
SAS
IBM 2008
Analyze -> Linear Regression
IBM 2008
Run
2.50236CoeffVar
0.9017Adj R-Sq68.86000Dependent Mean
0.9262R-Square1.72313Root MSE
0.00876.147.7060647.300001XX
0.98200.0211.200330.275001InterceptIntercept
Pr > |t|t ValueStandardError
ParameterEstimate
DFLabelVariable
Parameter Estimates
-
7/29/2019 RayWicks(IBMUSA) Trending HO
8/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Results
IBM 2008
Residuals
For each Xi, plot e = Y-Yi
Residual
-20
-15
-10
-5
0
5
10
0 100 200 300 400 500 600 700 800 900
Units of Work
Residual
Look forrandomdistributionaround 0
IBM 2008
Interesting Case
y = 0.0335x
R2
= 0.8569
0
5
10
15
20
25
30
35
40
0 100 200 300 400 500 600 700 800
Blocks
CPU%
Notice the points are below the line until >600. Typical of DB/DC. Means less
efficient as the load increases? The residuals have a pattern. That usuallymeans a second level effect.
IBM 2008
Regression other than Linear
y = 1.234e0.0043x
R2
= 0.9457
0
5
10
15
20
25
30
35
40
0 100 200 300 400 500 600 700 800
Blocks
CPU%
Exponential fit is useful when computing compound growth
-
7/29/2019 RayWicks(IBMUSA) Trending HO
9/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Perceptual to Conceptual Dissonance?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
05/21/04
05/28/04
06/04/04
06/11/04
06/18/04
06/25/04
07/02/04
07/09/04
07/16/04
07/23/04
07/30/04
08/06/04
08/13/04
08/20/04
08/27/04
09/03/04
09/10/04
09/17/04
09/24/04
10/01/04
10/08/04
10/15/04
10/22/04
10/29/04
11/05/04
(PS: Its a line)
y = -0.0002x + 8.2996
R2 = 0.4388 (CS: Not a good line)IBM 2008
Perceptual to Conceptual Dissonance
0.74
0.76
0.78
0.8
0.82
0.84
05/21/04
05/28/04
06/04/04
06/11/04
06/18/04
06/25/04
07/02/04
07/09/04
07/16/04
07/23/04
07/30/04
08/06/04
08/13/04
08/20/04
08/27/04
09/03/04
09/10/04
09/17/04
09/24/04
10/01/04
10/08/04
10/15/04
10/22/04
10/29/04
11/05/04
y = -0.0002x + 8.2996
R2 = 0.4388 (CS: Variability is scale independent)
(PS: Visual Variability is scale dependent)
IBM 2008
PS to CS Dissonance
y = -6E-08x3 + 0.0063x2 - 241.55x + 3E+06R2 = 0.7817 (CS: fit looks good)
0.72
0.74
0.76
0.78
0.8
0.82
0.84
05/21/04
05/28/04
06/04/04
06/11/04
06/18/04
06/25/04
07/02/04
07/09/04
07/16/04
07/23/04
07/30/04
08/06/04
08/13/04
08/20/04
08/27/04
09/03/04
09/10/04
09/17/04
09/24/04
10/01/04
10/08/04
10/15/04
10/22/04
10/29/04
11/05/04
(PS: Polynomial fit looks good)
IBM 2008
???
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
05/21/04
06/04/04
06/18/04
07/02/04
07/16/04
07/30/04
08/13/04
08/27/04
09/10/04
09/24/04
10/08/04
10/22/04
11/05/04
11/19/04
12/03/04
12/17/04
12/31/04
01/14/05
01/28/05
02/11/05
02/25/05
03/11/05
03/25/05
In 144 Days, the $ will be worthless.
-
7/29/2019 RayWicks(IBMUSA) Trending HO
10/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Regression Analysis is not a Crystal Ball
1.28
1.29
1.3
1.31
1.32
1.33
1.34
1.35
1.36
1.37
1/18/07 2/7/07 2/27/07 3/19/07 4/8/07 4/28/07 5/18/07 6/7/07 6/27/07 7/17/07
IBM 2008
Philosophical Remark
In reaching a conclusion, we negotiate between thepotential perceptual structures and the potentialconceptual structures and memory events.
Sensation
Context
(Lights
Up)
Negotiation y=-0.0002x+8.2996R2 =0.4388
0.74
0.75
0.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0.84
IBM 2008
Model Building: Which is Best?X1 X2 X3 X4 Y
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
Stepwise procedure to find the best combination of variables.Y = b + a1X1
Y = b + a1X1 + a2X2Y = b + a2X2 + a3X3
Y = b + a1X1 + a2X2 + a3X3 + a4X4 Using HaldData from Draper
IBM 2008
Stepwise ResultsStepwise Analysis
Table of Results for General Stepwise
X4 entered.
df SS MS F Significance F
Regress ion 1 1831.89616 1831.89616 22.7985202 0.000576232
R es id ua l 1 1 8 83 .8 66 91 69 8 0. 35 15 37 9
Total 12 2715.763077
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 117.5679312 5.262206511 22.34194552 1.62424E-10 105.9858927 129.1499696
X4 -0.738161808 0.154595996 -4.774779597 0.000576232 -1.078425302 -0.397898315
X1 entered.
df SS MS F Significance F
Regression 2 2641.000965 1320.500482 176.6269631 1.58106E-08
Res idual 10 74.76211216 7.476211216
Total 12 2715.763077
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 103.0973816 2.123983606 48.53963154 3.32434E-13 98.36485126 107.829912
X4 -0.613953628 0.048644552 -12.62122063 1.81489E-07 -0.722340445 -0.505566811X1 1.439958285 0.13841664 10.40307211 1.10528E-06 1.131546793 1.748369777
No other variables could be entered into the model. Stepwise ends.
Using Add-In from Levine
-
7/29/2019 RayWicks(IBMUSA) Trending HO
11/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
Looking for I/O = F(MIPS).Dont give up too quickly
y = 2.4545x
R2
= 0.3726
0
2000
4000
6000
8000
10000
12000
14000
16000
1500 2000 2500 3000 3500 4000 4500
MIPS
I/O
Y intercept forced to 0.
IBM 2008
Look at ratio in time
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
IO/MIPS
IBM 2008
Trending: What to Do?
Average In & Ready
0
5
10
15
20
25
30
35
40
0 100 200 300 400
90th%ile
IBM 2008
Options?
Average In & Ready
y = 7.2692e0.0042x
R2
= 0.6615
0
5
10
15
20
25
30
35
40
45
0 100 200 300 400 500
90th%ile
Linear (90th%ile)
Expon. (90th%ile)
-
7/29/2019 RayWicks(IBMUSA) Trending HO
12/12
8/5/200
Trending CMG Brazil (c) Ray Wicks
2008
IBM 2008
How About A Polynomial?
Average In & Ready
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500
90th%ile
Poly. (90th%ile)
A polynomial can be made to fit about any wandering data within the bounds of the data
[min,max]. Beyond the bounds, any prediction is suspect.
Y=b0 + b1X + b2X2 + b3X
3 + . + bnXn
IBM 2008
A time series is a sequence of observations which are ordered intime (or space). If observations are made on some phenomenonthroughout time, it is most sensible to display the data in theorder in which they arose, particularly since successiveobservations will probably be dependent. Time series are bestdisplayed in a scatter plot. The series value X is plotted on thevertical axis and time t on the horizontal axis. Time is called theindependent variable (in this case however, something overwhich you have little control).There are two kinds of time series data:1. Continuous, where we have an observation at every instant oftime e.g. lie detectors, electrocardiograms. We denote this usingobservation X at time t, X(t).2. Discrete, where we have an observation at (usually regularly)spaced intervals. We denote this as Xt.
Time Series
See http://www.cas.lancs.ac.uk/glossary_v1.1/tsd.html#timeseries
IBM 2008
Bibliography Applied Regression Analysis, Draper & Smith, Wiley. Definitivesource for regression analysis. Highly technical.
Statistical Concepts and Methods, Bhattacharyya & Johnson, Wiley,1977. This has both a discussion of meaning and the formulae.
Applied Statistics for Engineers and Scientists, Levine, Ramsey &Smidt, Prentice Hall, 2001. This has a good approach to statistics andExcel implementations. CD comes with the book which has somepowerful Excel Add-ins.
The Art of Computer Systems Performance Analysis, by Raj Jain,Wiley. I like this one. For performance analysis and capacity planning,it is thorough and complete. A very good reference. It may be hard tofind.
Chaos Under Control, by Peak & Frame, Freeman & Co.
http://www.itl.nist.gov/div898/handbook/pmc/pmc.htm is a good web
site to explore statistics.