CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745...

CPSC 531: Data Analysis 1

CPSC 531: Output Data Analysis

Instructor: Anirban MahantiOffice: ICT 745Email: [email protected] Location: TRB 101Lectures: TR 15:30 – 16:45 hours

Slides primarily adapted from:“The Art of Computer Systems

Performance Analysis” by Raj Jain, Wiley 1991.

[Chapters 12, 13, and 25]


Outline Measures of Central Tendency

Mean, Median, Mode How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal


Measures of Central Tendency (1) Sample mean – sum of all observations

divided by the total number of observations Always exists and is unique Mean gives equal weight to all observations Mean is strongly affected by outliers

Sample median – list observations in an increasing order; the observation in the middle of the list is the median; Even # of observations – mean of middle two

values Always exists and is unique Resistant to outliers (compared to mean)


0

0.1

0.2

0.3

0.4

0 4 8 12 16 20

x

PD

F f

(x)

Measures of Central Tendency (2) Sample mode – plot

histogram from the observations; find bucket with peak frequency; the middle point of this bucket is the mode; Mode may not exists

(e.g., all sample have equal weight)

More than one mode may exist (i.e. bimodal)

If only one mode then distribution is unimodal

0

0.05

0.1

0.15

0.2

0 4 8 12 16 20

x

PD

F f

(x)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 4 8 12

x

PD

F f

(x)

mode

mode mode

mode


Measure of Central Tendency (3)

Is data categorical? Yes: use mode e.g. most used resource in a system

Is total of interest? Yes: use mean e.g. total response time for Web requests

Is distribution skewed? Yes: use median

• Median less influenced by outlier than mean. No: use mean. Why?


Common Misuses of Means (1)

Usefulness of mean depends on the number of observations and the variance E.g. two response time samples: 10 ms and

1000 ms. Mean is 505 ms! Correct index but useless.

Using mean without regard to skewness System A System B10 59 5

11 5 10 4 10 31Mean: 10 10Mode: 10 5Min,Max: [9,11] [4,31]


Common Misuses of Means (2)

Mean of a Product by Multiplying means

Mean of product equals product of means if

the two random variables are independent.

If x and y are correlated E(xy) != E(x)E(y)

Avg. users in system 23; avg.

processes/user 2. Avg. # of processes in

system? Is it 46?

No! Number of processes spawned by users

depends on the load.


Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal


Summarizing Variability Summarizing by a single number rarely enough.

Given two systems with same mean, we generally prefer one with less variability

Freq

uenc

y

Mean=2s

Response Time

1.5 s80%

4 s20%

Freq

uenc

y

Mean=2s

Response Time

60%~ 0.001 s40%

~5 s

Indices of dispersion• Range, Variance, 10- and 90-percentiles, Semi-

interquantile range, and mean absolute deviation


Range Easy to calculate; range = max – min

In many scenarios, not very useful: Min may be zero Max may be an “outlier” With more samples, max may keep increasing

and min may keep decreasing → no “stable” point

Range is useful if systems performance is bounded


Variance and Standard Deviation Given sample of n observations {x1, x2, …, xn} the

sample variance is calculated as:

Sample variance: s2 (square of the unit of observation) Sample standard deviation: s (in unit of observation) Note the (n-1) in variance computation

(n-1) of the n differences are independent Given (n-1) differences, the nth difference can be computed Number of independent terms is the degrees of freedom (df)

n

ii

n

ii x

nxxx

ns

1

2

1

2 1 e wher

1

1


Standard Deviation (SD) Standard deviation and mean have same

units Preferred! E.g. a) Mean = 2 s, SD = 2 s; high variability? E.g. b) Mean = 2 s, SD = 0.2 s; low variability?

Another widely used measure – C.O.V C.O.V = Ratio of standard deviation to mean C.O.V does not have any units C.O.V shows magnitude of variability C.O.V in (a) is 1 and in (b) is .1


Percentiles, Quantiles, Quartiles Lower and upper bounds expressed in

percents or as fractions 90-percentile →0.9-quantile –quantile: sort and take [(n-1)+1]th observation

• [] means round to nearest integer

Quartiles divide data into parts at 25%, 50%, 75% → quartiles (Q1, Q2, Q3) 25% of the observations ≤ Q1 (the first quartlie) Second quartile Q2 is also the median

The range (Q3 – Q1) is interquartile range (Q3 – Q1)/2 is semi-interquartile (SIQR) range


Mean Absolute Deviation

Mean absolute deviation is calculated as:

xxn

n

ii

1

1


Influence of Outliers

Range: considerably Sample variance: considerably, but less than

range Mean absolute deviation: less than variance

Doesn’t square (aka magnify) the outliers SIQR range: very resistant

Use SIQR for index of dispersion whenever median is used as index of central tendency


Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data

Sample vs. Population Confidence Interval for Mean

Comparing Two Alternatives Transient Removal


Comparing Systems Using Sample Data

The words “sample” and “example” have a common root – “essample” (French)

One sample does not prove a theory - a sample is just an example

The point is - definite statement cannot be made about characteristics of all systems.

However, probabilistic statements about the range of most systems can be made

Confidence interval concept as a building block


Sample versus Population Generate 1-million random numbers

with mean and SD and put them in an urn Draw sample of n observations

{x1, x2, …, xn} has mean , standard deviation s

is likely different than !

The population mean is unknown or impossible to obtain in many real-world scenarios Therefore, obtain estimate of from

xx

x


Confidence Interval for the Mean Define bounds c1 and c2 such that:

Prob{c1 < < c2} = 1- (c1, c2) is confidence interval is significance level 100(1- ) is confidence level

Typically small desired confidence level 90%, 95% or 99%

One approach: take k samples, find sample means, sort, and take the [1+0.05(k-1)]th as c1 and [1+0.95(k-1)]th as c2


Central Limit Theorem We do not need many samples. Confidence

intervals can be determined from one sample because ~ N(, /sqrt(n))

SD of sample mean /sqrt(n) called Standard error

Using the CLT, a 100(1- )% confidence interval for a population mean is

( -z1-/2s/sqrt(n), +z1-/2s/sqrt(n)) z1-/2 is the (1-/2)-quantile of a unit normal

variate (and is obtained from a table!) s is the sample SD

x

x

x


Confidence Interval Example CPU times obtained by repeating

experiment 32 times. The sorted set consists of {1.9,2.7,2.8,2.8,2.8,2.9,3.1,3.1,3.2,3.2,3.3,3.4,3.6,3.7,3.8,

3.9,3.9,4.1,4.1,4.2,4.2,4.4,4.5,4.5,4.8,4.9,5.1,5.1,5.3,5.6,5.9}

Mean = 3.9, standard deviation (s) = 0.95, n=32

For 90% confidence interval z1-/2 = 1.645, and we get {3.90 + (1.645)(0.95)/(sqrt(32))} = (3.62,4.17)


Meaning of Confidence Interval

xx

- c

x

+ c

90% chance that this interval contains

What does this mean? With 90% confidence, we can say population mean is within the above bounds; that is, chance of error is 10%. E.g., Take 100 samples and construct CI’s. In 10

cases, the interval will not contain population mean


Length of Confidence Interval Let z1-/2s/sqrt(n) = c

Then, z1-/2 = (c.sqrt(n))/s Larger s implies wider confidence interval Larger n implies shorter confidence interval

• → with more observations, we are better able to predict population mean

• → square-root n relationship implies increasing observations by a factor of 4 only cuts confidence interval by a factor of 2.

Confidence Interval computation, as described here works for n ≥ 30.


What if n not large? For smaller samples, can construct

confidence intervals only if observations come from normally distributed population

t[1-α/2;n-1] is the (1-α/2)-quantile of a t-variate with (n-1) degrees of freedom

nstxnstx nn /,/ ]1;2/1[]1;2/1[


Testing for a Zero Mean Check if measured value is significantly

different than zero Determine confidence interval Then check if zero is inside interval. Procedure applicable to any other value a

0

mean

Mean is zero

Mean is nonzero


Comparing Two Alternatives Often interested in comparing systems

“naïve” VOD vs. “batching” VOD (assignment 3) “SJF” vs. “FIFO” request scheduling (assignment

1)

Statistical techniques for such comparison: Paired Observations Unpaired Observations (we will omit this!) Approximate Visual Test

Did you use any of these in your assignments?


Paired Observations (1) n experiments with one-to-one corrsp.

between test on system A and test on system B no correspondence => unpaired This test uses the zero mean idea…

Treat the two samples as one sample of n pairs

For each pair, compute difference Construct confidence interval for difference

CI includes zero => systems not significantly different


Paired Observations (2)

Six similar workloads used on two systems. {(5.4, 19.1), (16.6, 3.5), (0.6,3.4), (1.4,2.5), (0.6, 3.6) (7.3, 1.7)} Is one system better?

The performance differences are {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6}

Sample mean = -.32, sample SD = 9.03 CI = -0.32 + t[sqrt(81.62/6)] = -0.32 + t(3.69) .95 quantile of t with 5 DF’s is 2.015 90% confidence interval = (-7.75, 7.11) Systems not different as zero mean in CI


Approximate Visual Test Compute confidence interval for means If CI’s don’t overlap, one system better

than the other

meanmean mean

CI’s do not overlap => alternatives different

CI’s overlap and mean of one is in the CI of the other => not significantly diff.

CI’s overlap but mean of one is not in the CI of the other => need more testing


Determining Sample Size Goal: find the smallest sample size n such that

desired confidence in the results Method:

small set of preliminary measurements estimate variance from the measurements use estimate to determine sample size for accuracy

r% accuracy=> +r% at 100(1-)% confidence

2100

1001

xr

zsn

rx

n

szx


Transient Removal In many simulations, we are interested in

steady state performance Remove initial transient state

However, defining exactly what constitutes end of transient state is difficult!

Several heuristics developed: Long runs Proper initialization Truncation Initial data deletion Moving average of replications Batch means


Long Runs Use very long runs Impact of transient state becomes

negligible Wasteful use of resources How long is “long enough”? Raj Jain text recommends that this method

not be used in isolation


Batch Means Run simulation for long

duration Divide observations (N)

into m batches, each of size n

Compute variance of batch means using procedure shown for n = 2, 3, 4, 5 …

Plot variance vs. batch size

2

1

1

1

)(1

1)(Var

meansbatch of varianceCompute 3)

1

mean overall 2)Compute

,...,2,1 ,1

meanbatch Compute 1)

xxm

x

xm

x

mixn

x

m

ii

m

ii

n

iiji

Ignore

Variance ofBatch means

Batch Size n

Transientinterval

CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745...

Documents

Transcript of CPSC 531: Data Analysis1 CPSC 531: Output Data Analysis Instructor: Anirban Mahanti Office: ICT 745...