CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745...

35
CPSC 531: Data Analysis 1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: [email protected] Class Location: TRB 101 Lectures: TR 15:30 – 16:45 hours Slides primarily adapted from : “The Art of Computer Systems Performance Analysis” by Raj Jain, Wiley 1991. [Chapters 12, 13, and 25]

Transcript of CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745...

Page 1: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 1

CPSC 531: Output Data Analysis

Instructor: Anirban MahantiOffice: ICT 745Email: [email protected] Location: TRB 101Lectures: TR 15:30 – 16:45 hours

Slides primarily adapted from:“The Art of Computer Systems

Performance Analysis” by Raj Jain, Wiley 1991.

[Chapters 12, 13, and 25]

Page 2: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 2

Outline Measures of Central Tendency

Mean, Median, Mode How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal

Page 3: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 3

Measures of Central Tendency (1) Sample mean – sum of all observations

divided by the total number of observations Always exists and is unique Mean gives equal weight to all observations Mean is strongly affected by outliers

Sample median – list observations in an increasing order; the observation in the middle of the list is the median; Even # of observations – mean of middle two

values Always exists and is unique Resistant to outliers (compared to mean)

Page 4: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 4

0

0.1

0.2

0.3

0.4

0 4 8 12 16 20

x

PD

F f

(x)

Measures of Central Tendency (2) Sample mode – plot

histogram from the observations; find bucket with peak frequency; the middle point of this bucket is the mode; Mode may not exists

(e.g., all sample have equal weight)

More than one mode may exist (i.e. bimodal)

If only one mode then distribution is unimodal

0

0.05

0.1

0.15

0.2

0 4 8 12 16 20

x

PD

F f

(x)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 4 8 12

x

PD

F f

(x)

mode

mode mode

mode

Page 5: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 5

Measure of Central Tendency (3)

Is data categorical? Yes: use mode e.g. most used resource in a system

Is total of interest? Yes: use mean e.g. total response time for Web requests

Is distribution skewed? Yes: use median

• Median less influenced by outlier than mean. No: use mean. Why?

Page 6: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 6

Common Misuses of Means (1)

Usefulness of mean depends on the number of observations and the variance E.g. two response time samples: 10 ms and

1000 ms. Mean is 505 ms! Correct index but useless.

Using mean without regard to skewness System A System B10 59 5

11 5 10 4 10 31Mean: 10 10Mode: 10 5Min,Max: [9,11] [4,31]

Page 7: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 7

Common Misuses of Means (2)

Mean of a Product by Multiplying means

Mean of product equals product of means if

the two random variables are independent.

If x and y are correlated E(xy) != E(x)E(y)

Avg. users in system 23; avg.

processes/user 2. Avg. # of processes in

system? Is it 46?

No! Number of processes spawned by users

depends on the load.

Page 8: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 8

Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal

Page 9: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 9

Summarizing Variability Summarizing by a single number rarely enough.

Given two systems with same mean, we generally prefer one with less variability

Freq

uenc

y

Mean=2s

Response Time

1.5 s80%

4 s20%

Freq

uenc

y

Mean=2s

Response Time

60%~ 0.001 s40%

~5 s

Indices of dispersion• Range, Variance, 10- and 90-percentiles, Semi-

interquantile range, and mean absolute deviation

Page 10: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 10

Range Easy to calculate; range = max – min

In many scenarios, not very useful: Min may be zero Max may be an “outlier” With more samples, max may keep increasing

and min may keep decreasing → no “stable” point

Range is useful if systems performance is bounded

Page 11: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 11

Variance and Standard Deviation Given sample of n observations {x1, x2, …, xn} the

sample variance is calculated as:

Sample variance: s2 (square of the unit of observation) Sample standard deviation: s (in unit of observation) Note the (n-1) in variance computation

(n-1) of the n differences are independent Given (n-1) differences, the nth difference can be computed Number of independent terms is the degrees of freedom (df)

n

ii

n

ii x

nxxx

ns

1

2

1

2 1 e wher

1

1

Page 12: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 12

Standard Deviation (SD) Standard deviation and mean have same

units Preferred! E.g. a) Mean = 2 s, SD = 2 s; high variability? E.g. b) Mean = 2 s, SD = 0.2 s; low variability?

Another widely used measure – C.O.V C.O.V = Ratio of standard deviation to mean C.O.V does not have any units C.O.V shows magnitude of variability C.O.V in (a) is 1 and in (b) is .1

Page 13: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 13

Percentiles, Quantiles, Quartiles Lower and upper bounds expressed in

percents or as fractions 90-percentile →0.9-quantile –quantile: sort and take [(n-1)+1]th observation

• [] means round to nearest integer

Quartiles divide data into parts at 25%, 50%, 75% → quartiles (Q1, Q2, Q3) 25% of the observations ≤ Q1 (the first quartlie) Second quartile Q2 is also the median

The range (Q3 – Q1) is interquartile range (Q3 – Q1)/2 is semi-interquartile (SIQR) range

Page 14: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 14

Mean Absolute Deviation

Mean absolute deviation is calculated as:

xxn

n

ii

1

1

Page 15: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 15

Influence of Outliers

Range: considerably Sample variance: considerably, but less than

range Mean absolute deviation: less than variance

Doesn’t square (aka magnify) the outliers SIQR range: very resistant

Use SIQR for index of dispersion whenever median is used as index of central tendency

Page 16: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 16

Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data

Sample vs. Population Confidence Interval for Mean

Comparing Two Alternatives Transient Removal

Page 17: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 17

Comparing Systems Using Sample Data

The words “sample” and “example” have a common root – “essample” (French)

One sample does not prove a theory - a sample is just an example

The point is - definite statement cannot be made about characteristics of all systems.

However, probabilistic statements about the range of most systems can be made

Confidence interval concept as a building block

Page 18: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 18

Sample versus Population Generate 1-million random numbers

with mean and SD and put them in an urn Draw sample of n observations

{x1, x2, …, xn} has mean , standard deviation s

is likely different than !

The population mean is unknown or impossible to obtain in many real-world scenarios Therefore, obtain estimate of from

xx

x

Page 19: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 19

Confidence Interval for the Mean Define bounds c1 and c2 such that:

Prob{c1 < < c2} = 1- (c1, c2) is confidence interval is significance level 100(1- ) is confidence level

Typically small desired confidence level 90%, 95% or 99%

One approach: take k samples, find sample means, sort, and take the [1+0.05(k-1)]th as c1 and [1+0.95(k-1)]th as c2

Page 20: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 20

Central Limit Theorem We do not need many samples. Confidence

intervals can be determined from one sample because ~ N(, /sqrt(n))

SD of sample mean /sqrt(n) called Standard error

Using the CLT, a 100(1- )% confidence interval for a population mean is

( -z1-/2s/sqrt(n), +z1-/2s/sqrt(n)) z1-/2 is the (1-/2)-quantile of a unit normal

variate (and is obtained from a table!) s is the sample SD

x

x

x

Page 21: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 21

Confidence Interval Example CPU times obtained by repeating

experiment 32 times. The sorted set consists of {1.9,2.7,2.8,2.8,2.8,2.9,3.1,3.1,3.2,3.2,3.3,3.4,3.6,3.7,3.8,

3.9,3.9,4.1,4.1,4.2,4.2,4.4,4.5,4.5,4.8,4.9,5.1,5.1,5.3,5.6,5.9}

Mean = 3.9, standard deviation (s) = 0.95, n=32

For 90% confidence interval z1-/2 = 1.645, and we get {3.90 + (1.645)(0.95)/(sqrt(32))} = (3.62,4.17)

Page 22: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 22

Meaning of Confidence Interval

xx

- c

x

+ c

90% chance that this interval contains

What does this mean? With 90% confidence, we can say population mean is within the above bounds; that is, chance of error is 10%. E.g., Take 100 samples and construct CI’s. In 10

cases, the interval will not contain population mean

Page 23: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 23

Length of Confidence Interval Let z1-/2s/sqrt(n) = c

Then, z1-/2 = (c.sqrt(n))/s Larger s implies wider confidence interval Larger n implies shorter confidence interval

• → with more observations, we are better able to predict population mean

• → square-root n relationship implies increasing observations by a factor of 4 only cuts confidence interval by a factor of 2.

Confidence Interval computation, as described here works for n ≥ 30.

Page 24: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 24

What if n not large? For smaller samples, can construct

confidence intervals only if observations come from normally distributed population

t[1-α/2;n-1] is the (1-α/2)-quantile of a t-variate with (n-1) degrees of freedom

nstxnstx nn /,/ ]1;2/1[]1;2/1[

Page 25: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 25

Testing for a Zero Mean Check if measured value is significantly

different than zero Determine confidence interval Then check if zero is inside interval. Procedure applicable to any other value a

0

mean

Mean is zero

Mean is nonzero

Page 26: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 26

Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal

Page 27: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 27

Comparing Two Alternatives Often interested in comparing systems

“naïve” VOD vs. “batching” VOD (assignment 3) “SJF” vs. “FIFO” request scheduling (assignment

1)

Statistical techniques for such comparison: Paired Observations Unpaired Observations (we will omit this!) Approximate Visual Test

Did you use any of these in your assignments?

Page 28: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 28

Paired Observations (1) n experiments with one-to-one corrsp.

between test on system A and test on system B no correspondence => unpaired This test uses the zero mean idea…

Treat the two samples as one sample of n pairs

For each pair, compute difference Construct confidence interval for difference

CI includes zero => systems not significantly different

Page 29: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 29

Paired Observations (2)

Six similar workloads used on two systems. {(5.4, 19.1), (16.6, 3.5), (0.6,3.4), (1.4,2.5), (0.6, 3.6) (7.3, 1.7)} Is one system better?

The performance differences are {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6}

Sample mean = -.32, sample SD = 9.03 CI = -0.32 + t[sqrt(81.62/6)] = -0.32 + t(3.69) .95 quantile of t with 5 DF’s is 2.015 90% confidence interval = (-7.75, 7.11) Systems not different as zero mean in CI

Page 30: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 30

Approximate Visual Test Compute confidence interval for means If CI’s don’t overlap, one system better

than the other

meanmean mean

CI’s do not overlap => alternatives different

CI’s overlap and mean of one is in the CI of the other => not significantly diff.

CI’s overlap but mean of one is not in the CI of the other => need more testing

Page 31: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 31

Determining Sample Size Goal: find the smallest sample size n such that

desired confidence in the results Method:

small set of preliminary measurements estimate variance from the measurements use estimate to determine sample size for accuracy

r% accuracy=> +r% at 100(1-)% confidence

2100

1001

xr

zsn

rx

n

szx

Page 32: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 32

Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal

Page 33: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 33

Transient Removal In many simulations, we are interested in

steady state performance Remove initial transient state

However, defining exactly what constitutes end of transient state is difficult!

Several heuristics developed: Long runs Proper initialization Truncation Initial data deletion Moving average of replications Batch means

Page 34: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 34

Long Runs Use very long runs Impact of transient state becomes

negligible Wasteful use of resources How long is “long enough”? Raj Jain text recommends that this method

not be used in isolation

Page 35: CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745 Email: mahanti@cpsc.ucalgary.ca Class Location: TRB.

CPSC 531: Data Analysis 35

Batch Means Run simulation for long

duration Divide observations (N)

into m batches, each of size n

Compute variance of batch means using procedure shown for n = 2, 3, 4, 5 …

Plot variance vs. batch size

2

1

1

1

)(1

1)(Var

meansbatch of varianceCompute 3)

1

mean overall 2)Compute

,...,2,1 ,1

meanbatch Compute 1)

xxm

x

xm

x

mixn

x

m

ii

m

ii

n

iiji

Ignore

Variance ofBatch means

Batch Size n

Transientinterval