Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

61
Lecture 4

Transcript of Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Page 1: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Lecture 4

Page 2: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• For assignment 2, recode the smoking variable

• replace smoke=smoke-1

Page 3: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Recap from probability distributions

• The binomial distribution describes the probability of x successes in n independent trials, each with probability p of success

• The normal distribution may be used to describe cutoffs for some continuous random variables with mean µ and standard deviation – How do you know if your data are normally distributed?

• Histograms (stata: hist varname, normal)• QQ plots – next Biostat class• Other statistical tests

– What to do if my data are not normal?• Transformations – like taking the log, or the inverse 1/x …

• We will discuss an important property of the normal distribution today that applies to non-normally distributed data

Page 4: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Sampling• When we cannot measure the entire population we

take a sample• We estimate the population characteristics, i.e. the

mean and variance, using the sample• We use statistical inference to draw conclusions

about the how the estimates from the sample relate to the population values

Pagano and Gavreau, Chapter 8

Page 5: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• To do make inference from our sample to the population, our sample must be representative of the population– Random sample – each individual in the population has

equal chance of being selected for the sample– The larger the sample, the more reliable our estimates

about the population parameters will be• Because we do not have the entire population,

there is some level of uncertainty about our data• Confidence intervals afford us a way to quantify this

uncertainty

Page 6: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• Imagine you conducted a sample of size n from a population and measured random variable X, say systolic blood pressure

• Calculate the sample mean, X1 from the xi

• Take another sample of size n, calculate X2

• If you repeat for a long time you will have a large collection Xs generated from the samples of size n

• The Xs and standard deviations will differ from sample to sample due to sampling variability – each sample will most likley be different

Sampling distributions

Page 7: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• The collection of all of the possible Xs that can be obtained can be thought of as random variables that themselves follow a distribution

• This distribution is called the sampling distribution

• Imagine having a data set just of the means, the Xs, and plotting the histogram to see the shape of their distribution

Page 8: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• Why bother?– Sampling distributions of sample means have

special properties that allow us to make inference about the mean of a single sample

– We need this theory to be able to calculate confidence intervals or calculate p-values

Page 9: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Central limit theorem• If you have a random variable that comes from

a distribution with mean=µ and standard deviation σ, the following is true for the sampling distribution of the sample means from samples of size n– The mean of the sampling distribution (the

distribution of all of the possible sample means) is µ– The standard deviation of the sampling distribution

is σ/√n– If n is large enough, the shape of the sampling

distribution is approximately normal

Pagano and Gavreau, Chapter 8

Page 10: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Central limit theorem• Central limit theorem (CLT)

– If the underlying population is independently distributed with mean=µ and standard deviation=σ, then if we take a random sample of size n and n is large enough, the distribution of the sample mean is normally distributed.

– The mean of the sampling distribution will be µ and standard deviation will be σ/√n.

Pagano and Gavreau, Chapter 8

Page 11: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Central limit theorem• If we take a sample from any distribution (could be

skewed, or discrete, or whatever) of size n, and take the mean, and we do this over and over, the distribution of the means will be normally distributed with mean=the original distribution mean µ and standard deviation= σ/√n, if n is large enough

• The more symmetric the distribution of the raw data (not the means), the smaller the n needed to become normal-like

Pagano and Gavreau, Chapter 8

Page 12: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• Why does this make sense?– It makes sense that the distribution of means

would cluster around the population mean– It makes sense that the variability in the means is

smaller than in the raw data because the extreme values are already averaged out

– The part about the distribution being normal if n is large enough? Mathematical proof that I can’t do… But we can demonstrate it!

Page 13: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Note• σ is the standard deviation of the original

distribution • σ/√n is called the standard error, or more

precisely, the standard error of the mean, and it is the standard deviation of the distribution of the sample mean.

Pagano and Gavreau, Chapter 8

Page 14: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Central limit theorem example

• http://www.surveymonkey.com/s/F5VVHQZ• Download to excel• Change variable names• Import into stata• Histogram using dayofbirth1• Create average using self and first relative day of

birth (n=2) , dayofbirth_avg2• Make a histogram of dayofbirth_avg2• Repeat for using first 4, then 8 days of birth

Page 15: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Random draws from the chi-square distribution (µ=2)

0.0

5.1

.15

.2F

ract

ion

0 2 4 6 8 10 12 14 16x

Random draws

Page 16: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Means of 5 random draws from the chi-square distribution (µ=2)

0.0

2.0

4.0

6.0

8F

ract

ion

0 1 2 3 4 5 6 7mean of sample of size 5

Means of 5 random draws

Page 17: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Means of 10 random draws from the chi-square distribution (µ=2)

0.0

2.0

4.0

6.0

8F

ract

ion

.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5mean of sample of size 10

Means of 10 random draws

Page 18: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Means of 20 random draws from the chi-square distribution (µ=2)

0.0

2.0

4.0

6.0

8F

ract

ion

.5 1 1.5 2 2.5 3 3.5 4 4.5mean number of successes

Means of 20 random draws

Page 19: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Means of 30 random draws from the chi-square distribution (µ=2)

0.0

2.0

4.0

6.0

8F

ract

ion

.5 1 1.5 2 2.5 3 3.5mean of sample of size 30

Means of 30 random draws

Page 20: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Distributions of the means of chi-square random draws

0.0

5.1

.15

.2F

ract

ion

0 2 4 6 8 10 12 14 16x

Random draws

0.0

2.0

4.0

6.0

8.1

Fra

ctio

n0 2 4 6 8 10 12

mean of sample of size 2

Means of 2 random draws

0.0

2.0

4.0

6.0

8F

ract

ion

0 1 2 3 4 5 6 7mean of sample of size 5

Means of 5 random draws

0.0

2.0

4.0

6.0

8F

ract

ion

.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5mean of sample of size 10

Means of 10 random draws

0.0

2.0

4.0

6.0

8F

ract

ion

.5 1 1.5 2 2.5 3 3.5 4 4.5mean of sample of size 20

Means of 20 random draws

0.0

2.0

4.0

6.0

8F

ract

ion

.5 1 1.5 2 2.5 3 3.5mean of sample of size 30

Means of 30 random draws

Page 21: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Distributions of the means of binomial random draws0

.1.2

.3.4

Fra

ctio

n

0 1 2 3 4 5 6x successes

Random binomial draws

0.1

.2.3

Fra

ctio

n0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5

mean number of successes

Means of 2 random binomial draws

0.0

5.1

.15

.2F

ract

ion

0 .5 1 1.5 2 2.5 3 3.5mean number of successes

Means of 5 random binomial draws

0.0

5.1

.15

Fra

ctio

n

0 .5 1 1.5 2 2.5mean number of successes

Means of 10 random binomial draws

0.0

2.0

4.0

6.0

8.1

Fra

ctio

n

.2 .4 .6 .8 1 1.2 1.4 1.6 1.8 2mean number of successes

Means of 20 random binomial draws

0.0

2.0

4.0

6.0

8F

ract

ion

.4 .6 .8 1 1.2 1.4 1.6 1.8mean number of successes

Means of 30 random binomial draws

Page 22: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Using the CLT• Suppose you sampled from a HIV-infected population

with mean µ CD4 count = 250 cells/mm3 and standard deviation σ = 200 cells/mm3.

• If we select repeated samples of size 50, what proportion of the samples will have a mean value of less than 100 cells/mm3 ?

• Using the CLT, we know that the mean of all the samples, X, will follow a normal distribution with mean µ=250 and standard error σ/ √n=200/ √50

• Then we know that (X-250)/(200/ √50) ~ N(0,1)• If the mean cutoff value is 100, we want P(Z<100) then

z=(100-250)/(200/ √50)= -150 / (200/ 7.07) = -150/28.3 = -5.3

P(Z<-5.3) = ?

Page 23: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Using the CLT• What level of CD4 count is the lower 10th

percentile of the mean values?• P(Z<=z)=.10 for what value of z?• Table A.3 give the value P(Z>=z)• P(Z<=z) = - P(Z>=z) for the same value of z• The value of z for which P(Z>=z) = 0.10 is ____• The lower 10th percentile cutoff for z is ____• Now we need to transform back to get X• Using

nZ X

/

_

Page 24: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Using the CLT• What level of CD4 count is the lower 2.5th

percentile of the mean values?• P(Z<=z)=.025 for what value of z?

– Remember the tails of the standard normal distribution or look up in Table A.3

• The value of z for which P(Z<=z) = 0.025 is ____

• The value of z for which P(Z>z) = 0.025 is ____• Now we need to transform back to get X• Using

nZ X

/

_

Page 25: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Using the CLT• What level of CD4 count is the upper 2.5th

percentile of the mean values?• P(Z<=z)=.025 for what value of z?• The value of z for which P(Z>=z) = 0.025 is

____• The lower 2.5th percentile cutoff for z is ____• Now we need to transform back to get X• Using

nZ X

/

_

Page 26: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• Now we have the lower and upper 2.5% percentiles of the distribution of the sample means.

• The interior area contains 95% of the sample means.

• 95% of the means from sample size 50 lie within the 95% confidence bounds (194.6, 305.4)

• If we selected a sample of size 50 and the sample mean was outside these percentiles, we might suspect it came from an underlying population with a different population mean and standard deviation, or that a rare (5% probability) event had occurred.

Page 27: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• The confidence interval for the mean depends on the sample size, n. If the sample size was 300, what would be the interval?

• -1.96 <= (X – 250 )/(200/ √ 300) <= 1.96

• The lower and upper limits would be:

227.4 <= X <= 272.6

Which are narrower than the limits for n=50

(194.6, 305.4)

• As n increases, the width of the interval decreases

Page 28: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• X , the sample mean, is a point estimate of , the population mean

• Different samples will yield different Xs, so we cannot be certain how our estimate differs from

• Interval estimation provides a range of reasonable values that contain the population parameter (in this case ) with a certain degree of confidence

• This interval is called a confidence interval

Pagano and Gavreau, Chapter 9

Page 29: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• We put together what we learned about the normal distribution and the central limit theorem in order to construct confidence intervals

• By the CLT, X follows a normal distribution if n is sufficiently large X ~ N(,/√n)

• So, follows a standard normal distribution Z ~ N(0,1)

Pagano and Gavreau, Chapter 9

nZ X

/

_

Page 30: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• We know from examining the standard normal distribution that P(-1.96 ≤ Z ≤ 1.96) = 0.95

Pagano and Gavreau, Chapter 9

-5 -4 -3 -2 -1 0 1 2 3 4 5x

Standard normal distribution

95%2.5%2.5%

Page 31: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• P(-1.96 ≤ Z ≤ 1.96) = 0.95Substituting the formula for Z into the above we get

95.0)96.1/

96.1(

_

n

P X

Multiplying by σ/√n , adding X , and multiplying by -1, we get

95.0)/96.1/96.1(__

nnP XX

Pagano and Gavreau, Chapter 9

Page 32: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

Thus the lower 95% confidence limit for µ is

And the upper 95% confidence limit for µ is

So we are 95% confident that the interval we calculate using the above includes

nX /96.1_

nX /96.1_

Pagano and Gavreau, Chapter 9

Page 33: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

An important subtlety:X is a random variable

is a population parameter that is fixed in perpetuity; it has the same value irrespective of the sample

is either in the interval you calculate or it is not

What is random is the interval because it is based on the sample (X - 1.96/√n , X + 1.96/√n )

Pagano and Gavreau, Chapter 9

Confidence intervals for means

Page 34: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Interpreting confidence intervals for means

• The probability that the interval contains the true population mean is 95%

• If we were to select 100 random samples from the population and calculate confidence intervals for each, approximately 95 of them would include the true population mean µ (and 5 would not)

Pagano and Gavreau, Chapter 9

Page 35: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• 90% confidence interval– Replace 1.96 in the formula with 1.64

• 99% confidence interval– Replace 1.96 in the interval with 2.58

Generic formula:

Where 100%*(1-) is the % of the confidence interval

E.g. for a 95% confidence interval, =0.05, and we use z0.025 =1.96

Pagano and Gavreau, Chapter 9

)/,/( 2/

_

2/

_

nznz XX

Page 36: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means• How to get a tighter interval?

– Decrease the confidence level

– Increase n

Pagano and Gavreau, Chapter 9

Confidence level Z/2

99% .01 2.58

95% .05 1.96

90% .10 1.64

80% .20 1.28

n 95% confidence limits Length of interval

10 X 1.96/√10 = X 0.620 1.240100 X 1.96/√100 = X 0.3920 0.7841000 X 1.96/√1000 = X 0.062 0.124

Page 37: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• Uniform distribution demonstration

Page 38: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means• What to do when σ is not known? (In practice, always)

• By the Central limit theorem, follows a normal distribution, if n is sufficiently large

• Can we substitute s, the sample standard deviation for ?• s is not a reliable estimate of if n is small• If X is normally distributed, and a sample of size n is chosen, then

follows a Student’s t distribution

with n-1 degrees of freedomThis is denoted tn-1

Pagano and Gavreau, Chapter 9

nZ X

/

_

1

)(1

2

n

xxs

n

ii

nst X

/

_

Page 39: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Student’s t distribution

• The mean of the t distribution is 0 and the standard deviation is 1

• The t distribution is symmetric and bell-shaped, but has heavier tails than the standard normal – extreme values are more likely to occur

• For small n, the tails are fatter• For large n, the t distribution approaches (i.e. becomes

indistinguishable from) the standard normal distribution

Pagano and Gavreau, Chapter 9

Page 40: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

The t-distribution

Page 41: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Student’s t distribution• There are separate curves for each degree of freedom (df)

– Table A.4 gives t value for selected P(T>t) and selected df

• Better to use Stata:• P(T>t) is calculated using ttail *****

**** note that normal() is P(Z<z)!!!• The code is ttail(df,t)• E.g., P(T>1.95) n=20

display ttail(19,1.95)

.03304428

• To find the value for which P(T>t) use invttail(df,p)• The answer for this the t cutoff is denoted t19,.05

display invttail(19,.05)

1.7291328

Pagano and Gavreau, Chapter 9

Page 42: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means• So using the t-distribution, the general formula for a 1-

% confidence interval for a mean is:

• The formula for a 95% confidence interval for a mean is:

Where df =n-1

Pagano and Gavreau, Chapter 9

)/,/( 2/,

_

2/,

_

nstnst dfdf XX

)/,/( 025.0,

_

025.0,

_

nstnst dfdf XX

Page 43: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• Remember that when n is large, the t distribution approaches the normal distribution– E.g. z0.025 = 1.96

– While tn-1,0.025 =

Pagano and Gavreau, Chapter 9

n tn-1,0.025

2 12.706

3 4.303

5 2.776

9 2.262

50 2.010

100 1.984

200 1.972

300 1.968

500 1.965

1000 1.962

Page 44: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for means

• Example• CD4 cell count among HIV positives diagnosed at Mulago

Hospital– N=270– Sample mean = 296.9– Sample SD = 255.4– t269,.025=1.969

– 95% CI = ( 296.6 – 1.969*255.4/√270, 296.6 + 1.969*255.4/√270)= (266.0-327.2)

Pagano and Gavreau, Chapter 9

Page 45: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• Note that some statistical output gives you the SE or the SEM, which stand for standard error or standard error of the mean.

• This is s/ √n which is the standard deviation of the distribution of X

• Remember, if X is a random variable with mean µ and standard deviation , if n is large enough, X is normally distributed with mean µ and standard deviation / √n

Page 46: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

. summ ncigs

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- ncigs | 545 .1963303 1.405081 0 20

. mean ncigs

Mean estimation Number of obs = 545

-------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval]-------------+------------------------------------------------ ncigs | .1963303 .060187 .0781028 .3145577--------------------------------------------------------------

. ci ncigs

Variable | Obs Mean Std. Err. [95% Conf. Interval]-------------+------------------------------------------------------------- ncigs | 545 .1963303 .060187 .0781028 .3145577

Page 47: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Normal approximation to the binomial distribution

• Remember that binomial distributions are used to describe the number of success in n trials P(X=x)

• The parameters of the binomial distribution are n and p, and the mean=np and standard deviation=square root of (np(1-p))

• As n increases, the binomial distribution more closely resembles the normal distribution

Page 48: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

0.2

.4.6

Bin

omia

l pro

babi

lity

0 5 10 15 20n successes

n=10 p=0.05

0.1

.2.3

.4B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=20 p=0.050

.05

.1.1

5.2

.25

Bin

omia

l pro

babi

lity

0 5 10 15 20n successes

n=50 p=0.05

0.0

5.1

.15

.2B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=100 p=0.05

Page 49: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

0.0

5.1

.15

.2.2

5B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=10 p=0.35

0.0

5.1

.15

.2B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=20 p=0.350

.05

.1.1

5B

inom

ial p

roba

bilit

y

0 10 20 30 40n successes

n=50 p=0.35

0.0

2.0

4.0

6.0

8B

inom

ial p

roba

bilit

y

0 20 40 60n successes

n=100 p=0.35

Page 50: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Binomial approximation to normal distribution

• Note that the binomial distribution approaches normality at smaller sample sizes when p is closer to 0.5

• Therefore you could use the normal distribution to look up the probability of observing X or more (or less) successes, using np as the mean and square root[ np(1-p)] as the standard deviation

• Considered valid when np and np(1-p) ≥5

Page 51: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

• It is easier to use the normal distribution than to use table A.1. For example, if n=50 and you wanted to know the P(X>=30), using table A.1 which gives you P(X=x), you would need to find

P(X=30) + P(X=31) + .... + P(X=50)

(Although in Stata the binomialtail function does actually give you P(X≥x)

• Continuity correction – because you are approximating a discrete distribution with a continuous one, add 0.5 to X

Page 52: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Sampling distribution of a proportion

• Previous slides were about estimating X, the number of successes

• We often are more interested in the proportion of successes, rather than the number of successes

• The true population proportion p is estimated by

x = the number of successes/eventsn=the number of trials/people/observations

hat) (p

/ˆ nxp

Page 53: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Sampling distribution of a proportion

• In our data set, we use 1s to represent the event and 0 to represent no event for each person or observation

• These are Bernoulli random variables – they have the value 0 or 1 with probability p

• The mean of the population of event/no events is p (this is the Bernoulli mean)

• The standard deviation σ of the population of events/no events is √p(1-p) (this is the Bernoulli standard deviation)

Page 54: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Sampling distribution of a proportion

• If we take repeated samples of size n, and calculate p hat=x/n for each of the samples, if n is large enough, the p hats will follow a normal distribution (by the central limit theorem)– Remember that the sampling distribution is needed to

be able to make inference about a single sample

• The mean of this sampling distribution is p• The standard deviation is σ/√n = which is

also called the standard errorn

pp )1(

Page 55: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Sampling distribution of proportions• Then ~ N(0,1)

• This holds when np≥5 and np(1-p) ≥5 and also by the CLTThen we can use the normal distribution to calculate probabilities

of observing certain proportions in a sample

• E.g. What proportion of samples of size 50 from a population with p=.10 will have a p hat of .20 or higher?

• What is P(p hat ≥ 0.20)? • Mean=0.10; SD= √(.10*.90) =0.3; SEM = SD/√50 = 0.0424• P(Z ≥ ((.20-.10)/.0424)) = P(Z ≥ 2.36)

. display 1-normal(2.36)

.00913747

Pagano and Gavreau, Chapter 14

npp

ppZ

)1(

ˆ

Page 56: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for proportions ~ N(0,1)

– So

• Rearranging, we get

• • Lower 95% confidence limit: or

• Upper 95% confidence limit: or

Pagano and Gavreau, Chapter 14

n

ppp

)1(*96.1ˆ

n

ppp

)1(*96.1ˆ

npp

ppZ

)1(

ˆ

95.0)96.1/)1(

ˆ96.1(

npp

ppP

95.0)/)1(96.1ˆ/)1(96.1ˆ( nppppnpppP

Page 57: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for proportions

• However we don’t know p (if we did we wouldn’t be calculating these intervals). So we substitute p hat into the formula for the SEM.

• Lower 95% confidence limit:

• Upper 95% confidence limit:

• This interval has a 95% chance of containing the true population parameter p

Pagano and Gavreau, Chapter 14

n

ppp

)ˆ1(ˆ*96.1ˆ

n

ppp

)ˆ1(ˆ*96.1ˆ

Page 58: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Confidence intervals for proportions

• HIV prevalence in those testing at Mulago Hospital– N=933– n HIV+ = 269– Prevalence = 269/933 = 0.288– Standard error estimate = sqrt [ .288*(1-.288)/933 ] = 0.015

– 95% CI : (.288 – 1.96*.015, .288 + 1.96*.015 )• = (.259, .317)

– Interpretation: we are 95% confident that the interval 0.259-0.317) includes the true HIV prevalence in the population

Pagano and Gavreau, Chapter 14

Page 59: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

Key points

• It is not practical or feasible to study an entire population, so we take a sample

• We need to make inference from our sample to the population

• We use the properties of repeated samples to do so• For any random variable X with mean µ and standard

deviation σ, if the sample n is large enough, the distribution of the sample mean is normally distributed with mean µ and standard deviation σ/ √n

• We use this to calculate intervals with known probability of containing the population mean

Page 60: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.

For next time

• Read Pagano and Gauvreau– Chapter 8, 9, and 14 (pages 324-329) (Review of

today’s material)

– Chapter 10 and 14 (pages 329-330)

Page 61: Lecture 4. For assignment 2, recode the smoking variable replace smoke=smoke-1.