STATA intro
-
Upload
appnu2dwild -
Category
Documents
-
view
88 -
download
2
description
Transcript of STATA intro
1
Tutorial: Life Tables in Stata
/LIHWDEOHVOLVWWKHGHDWKUDWHVH[SHULHQFHGE\DSRSXODWLRQRYHUDJLYHQSHULRGRIWLPH7KH\KDYHPDQ\SUDFWLFDOXVHV)RUH[DPSOHLQVXUDQFHFRPSDQLHVXVHWKHPWRGHWHUPLQHSUHPLXPVDQGDQQXLWLHVWKHJRYHUQPHQWXVHVWKHPWRSODQIRUVRFLDOVHFXULW\
/LIHWDEOHVDUHHDV\WRFRPSXWHLQ6WDWDWKURXJKWKHXVHRIWKHltable FRPPDQG7REHJLQGRZQORDGWKHlifetable.dtaGDWDVHWIURPWKHFRXUVHZHEVLWHDQGRSHQLWLQ6WDWD7KLVGDWDVHWZDVJHQHUDWHGIURPRQHRIWKHILUVWOLIHWDEOHVUHFRUGHGGDWLQJEDFNWRWKHODWHWKFHQWXU\
:KDWLVWKHPHDQOLIHVSDQ":KDWLVWKHPHGLDQ"VXPPDJHGHWDLO
JHWWKHPHDQDQGWKHPHGLDQIURPWKHYDOXHRIWKHYDOXHLQWKHSHUFHQWLOHFROXPQ
:KDWGRHVWKHKLVWRJUDPRIDJHDWGHDWKORRNOLNH",VLWV\PPHWULF"*UDSKLFV!+LVWRJUDP!6HOHFWDJHDVYDULDEOH
&RPPDQGKLVWRJUDPDJH
6\PPHWULFDIWHUDQLQLWLDOSHDNLQGHDWKWLPHVDURXQGDJH
2
8VHWKHltable FRPPDQGWRJHQHUDWHDOLIHWDEOH
D :KDWLVWKHFKDQFHRIVXUYLYLQJIURPELUWKXQWLODJH"&RPPDQGOWDEOHDJH%XWLIZHXVHWKLVFRPPDQGDOOWKHLQWHUYDOVDUHRIOHQJWKZKLFKLVQWYHU\KHOSIXO6RZHZLOOXVHWKHLQWHUYDORSWLRQ:HZDQWLQWHUYDOVRIOHQJWK
&RPPDQGOWDEOHDJHLQWHUYDOVWDUWDWHQGDWLQVWHSVRI6((%(/2:E :KDWLVWKHSURSRUWLRQRILQGLYLGXDOVDOLYHRQWKHLUWKELUWKGD\ZKRGLHEHIRUH
WKHLUWKELUWKGD\"
3HRSOHDOLYHDWDJH YDOXHLQ SHRSOH3HRSOHZKRGLHGDWDJH
7KHUHIRUHSURSRUWLRQ F :KDWLVWKHFKDQFHWKDWD\HDUROGZLOOVXUYLYH\HDUV"
+RZPDQ\SHRSOHZHUHDOLYHDWDJH" YDOXHLQURZ
+RZPDQ\SHRSOHZHUHDOLYHDWDJH" YDOXHLQURZ
3URSRUWLRQ G :KDWLVWKHFKDQFHWKDWD\HDUROGZLOOVXUYLYHWRDJH"
)RUDDERYHILQGWKHQXPEHURISHRSOHDOLYHDWDJHWKHURZ,QWKLVFDVHWKHYDOXHZDV
+HQFHRXUYDOXHIRUVXUYLYDOXQWLODJH $OWHUQDWLYHO\YDOXHLQVXUYLYDOFROXPQ
DWDJHURZLV$QVZHUIRUD
GDOLYHDWDJH DOLYHDWDJH 3URSRUWLRQ
Example: Probability of hypertension at baseline ,QWKH)UDPLQJKDPGDWDVHWRIWKHSDUWLFLSDQWVGLGQRWKDYHK\SHUWHQVLRQDW
EDVHOLQHDQGGLGKDYHK\SHUWHQVLRQDWEDVHOLQH8VLQJWKLVLQIRUPDWLRQZKDWLVWKHSUREDELOLW\WKDWDUDQGRPO\VHOHFWHGSDUWLFLSDQWLQWKH)UDPLQJKDPVWXG\KDGK\SHUWHQVLRQDWEDVHOLQH"
:KDWLVWKHSUREDELOLW\WKDWWKLVSDUWLFLSDQWGLGQRWKDYHK\SHUWHQVLRQDWEDVHOLQH"$UHWKHVHHYHQWVPXWXDOO\H[FOXVLYHH[KDXVWLYHQHLWKHURUERWK":KDWLVWKHSUREDELOLW\WKDWWKUHHUDQGRPO\VHOHFWHGSDUWLFLSDQWVDOOGRQRWKDYH
K\SHUWHQVLRQDWEDVHOLQH"
6XSSRVHZHDJDLQUDQGRPO\VHOHFWWZRSDUWLFLSDQWVIURPWKLVSRSXODWLRQ:KDWLVWKHSUREDELOLW\WKDWERWKSDUWLFLSDQWVKDYHK\SHUWHQVLRQDWEDVHOLQHJLYHQWKDWDWOHDVWRQHRIWKHSDUWLFLSDQWVKDGK\SHUWHQVLRQ
Example: Relationship between hypertension and CHD using probability laws :HH[DPLQHWKHUHODWLRQVKLSEHWZHHQK\SHUWHQVLRQDQG&+'DWEDVHOLQHLQWKH)UDPLQJKDPVWXG\SRSXODWLRQXVLQJWKHFRQFHSWVRISUREDELOLW\OHDUQHGWKLVZHHN
D :KDWLVWKHSUREDELOLW\WKDWD)UDPLQJKDPSDUWLFLSDQWKDVK\SHUWHQVLRQRU&+'DWEDVHOLQH"
E $UHWKHVHWZRHYHQWVLQGHSHQGHQW":RXOG\RXH[SHFWWKHVHHYHQWVWREHLQGHSHQGHQW"
F :KDWLVWKHSUREDELOLW\WKDWDSDUWLFLSDQWKDV&+'DWEDVHOLQH":KDWLVWKHSUREDELOLW\WKDWDSDUWLFLSDQWKDV&+'DWEDVHOLQHJLYHQWKDWKHVKHKDVK\SHUWHQVLRQ"
Tutorial: ROC Curves in Stata 52&FXUYHVLOOXVWUDWHWKHLQKHUHQWWUDGHRIIRIEHWZHHQVHQVLWLYLW\DQGVSHFLILFLW\:HH[DPLQH52&FXUYHVLQWKHFRQWH[WRIULVNSUHGLFWLRQ
&RQVLGHUWKHIROORZLQJVFHQDULR\RXDUHUHVSRQVLEOHIRUWHOOLQJDSDWLHQWWKDWWKH\DUHDWKLJKRUORZULVNIRU&+'JLYHQVRPHEDVHOLQHSURJQRVWLFIDFWRUV8VLQJ)UDPLQJKDPGDWDVHW\RXFDQSUHGLFWWKHSUREDELOLW\WKDWDQLQGLYLGXDOJHWV&+'JLYHQWKHLUEDVHOLQHSURJQRVWLFIDFWRUV
Constructing an ROC curve to evaluate a risk prediction model:
8VLQJV\VWROLFEORRGSUHVVXUHQXPEHURIFLJDUHWWHVVPRNHGSHUGD\WRWDOFKROHVWHUROVH[DQG%0,DWEDVHOLQHSUHGLFWWKHSUREDELOLW\WKDWHDFKLQGLYLGXDOLQWKH)UDPLQJKDPGDWDVHWKDG&+'&DOOWKLVSUREDELOLW\S
$VLQWKHGLDJQRVWLFWHVWLQJVHWWLQJVHOHFWDFXWRIISUREDELOLW\FWRGLVWLQJXLVKKLJKDQGORZULVNSDWLHQWV,ISFWKHSDWLHQWLVORZULVN,ISFWKHSDWLHQWLVKLJKULVN
&ODVVLI\DOOSDWLHQWVLQWKHGDWDVHWDVKLJKULVNRUORZULVNXVLQJWKHFXWRIIF
&DOFXODWH3KLJKULVN_&+' VHQVLWLYLW\&DOFXODWH3KLJKULVN_QR&+' ±VSHFLILFLW\
6WHSV±DUHEH\RQGWKHVFRSHRIWKLVPRGXOH7KHVHYDOXHVDUHSURYLGHGIRU\RXLQWKHGDWDVHWroc.dta. Open the dataset roc.dta in Stata.
)RUWKHYDULRXVYDOXHVRIFSORWWKHIDOVHSRVLWLYHUDWHYHUVXVVHQVLWLYLW\&RQQHFWWKHOLQHVWRJHQHUDWH\RXU52&FXUYH
&RQVLGHUWKHIROORZLQJTXHVWLRQV
+RZGRWKHVHQVLWLYLW\DQGVSHFLILFLW\FKDQJHDVWKHFXWRIILQFUHDVHVIURPWR"
:KDWYDOXHRIFZRXOG\RXFKRRVHLQGLVWLQJXLVKLQJKLJKULVNYHUVXVORZULVNSDWLHQWV":K\"
7DEOH3RLQWVRQ52&FXUYHIRUULVNSUHGLFWLRQPRGHO
&XWRIIF
6HQVLWLYLW\ 6SHFLILFLW\
)DOVH3RVLWLYH6SHFLILFLW\
1 0 1.0000 0.0000 0.7901 0.0051 1.0000 0.0000 0.7152 0.0071 0.9988 0.0012 0.6592 0.0111 0.9966 0.0034 0.6055 0.0547 0.9931 0.0069 0.5695 0.0993 0.9875 0.0125 0.5055 0.1682 0.9654 0.0346 0.4595 0.2381 0.9480 0.0520 0.4049 0.3506 0.9221 0.0779 0.3555 0.4205 0.8794 0.1206 0.3029 0.5228 0.7997 0.2003 0.2545 0.6383 0.6963 0.3037 0.2031 0.7285 0.5723 0.4277 0.1559 0.8379 0.4184 0.5816 0.1044 0.9119 0.2408 0.7592 0.0571 0.9899 0.0551 0.9449
0 1.0000 0 1
3ORW52&FXUYHIRUULVNSUHGLFWLRQPRGHO
6SHFLILFLW\
)DOVHSRVLWLYHUDWH
52&&XUYH
Tutorial: More on ROC curves and complicated graphs in Stata
:HFRQVWUXFWDQHZVLPSOHUULVNSUHGLFWLRQPRGHOXVLQJRQO\V\VWROLFEORRGSUHVVXUHGLDVWROLFEORRGSUHVVXUHDQGDJHDVRXUSURJQRVWLFIDFWRUV:HFRPSDUHWKLVULVNSUHGLFWLRQPRGHOWRWKHPRGHOLQWKHSUHYLRXVWXWRULDO
2SHQWKHGDWDVHWroc.dtaRQWKHFRXUVHZHEVLWH
:HFRQVWUXFWDSORWWKDWLQFOXGHV
WKH52&FXUYHIRUWKHILUVWPRGHOIURPWKHSUHYLRXVWXWRULDOZLWKPDQ\SURJQRVWLFIDFWRUVFDOOHG0RGHO
WKHVHFRQGPRGHOZLWKRQO\VH[DQGEORRGSUHVVXUH0RGHODQG DUHIHUHQFHOLQHUHSUHVHQWLQJDUELWUDU\FODVVLILFDWLRQDVKLJKRUORZULVN
2YHUOD\LQJOLQHVLQ6WDWDLVUHODWLYHO\HDV\ZLWKLQWKHTwoway graphZLQGRZ8VLQJWKH52&SORWFRQVLGHUWKHIROORZLQJTXHVWLRQV
0RGHORXWSHUIRUPVPRGHO+RZFDQ\RXWHOOWKLVIURPWKH52&FXUYH"
:KLFKPRGHOZRXOG\RXUHFRPPHQG"
/DWHULQWKHFRXUVHZHOHDUQKRZWRILWWKHPRGHOWRREWDLQWKHSUHGLFWHGULVNV:LWKQHZELRPDUNHUVDQGJHQHWLFULVNIDFWRUVSRSSLQJXSDOOWKHWLPHULVNSUHGLFWLRQLVDKRWWRSLFLQVWDWLVWLFVULJKWQRZDQG52&FXUYHVDUHXVHGIUHTXHQWO\
7DEOH3RLQWVRQ52&FXUYHIRUPRGHO
&XWRIIF
6HQVLWLYLW\ 6SHFLILFLW\
)DOVH3RVLWLYH6SHFLILFLW\
1 0.0000 1.0000 0.0000 0.7901 0.0273 0.9997 0.0003 0.7152 0.0298 0.9991 0.0009 0.6592 0.0324 0.9975 0.0025 0.6055 0.0434 0.9942 0.0058 0.5695 0.0792 0.9804 0.0196 0.5055 0.1014 0.9699 0.0301 0.4595 0.2002 0.9460 0.0540 0.4049 0.2666 0.9064 0.0936 0.3555 0.4370 0.8224 0.1776 0.3029 0.5860 0.7006 0.2994 0.2545 0.6959 0.6046 0.3954 0.2031 0.7641 0.4736 0.5264 0.1559 0.8739 0.3371 0.6629 0.1044 0.9838 0.0865 0.9135 0.0571 1.0000 0.0000 1.0000
0 1.0000 0.0000 1.0000
3ORW52&FXUYHIRU0RGHOVDQGZLWKUHIHUHQFHOLQH
6HQVLWLYLW\
)DOVHSRVLWLYHUDWH
0RGHO
0RGHO
52&&XUYH
Example: Sensitivity, Specificity, PPV, NPV, and Bayes Theorem7KH:RUOG+HDOWK2UJDQL]DWLRQFRQGXFWVVXUYH\VLQFRXQWULHVWRGHFODUHQHRQDWDOWHWDQXV17HOLPLQDWLRQ7RGLDJQRVH17GHDWKVLQUXUDOORFDWLRQVZRPHQDUHLQWHUYLHZHGXVLQJWKHRUDODXWRSV\PHWKRG1RWDWLRQ'ZRPDQKDGDOLYHLQIDQWZKRGLHGRIQHRQDWDOWHWDQXV'
ZRPDQKDGDOLYHLQIDQWZKRGLGQRWGLHRI177WKHRUDODXWRSV\FRQFOXGHGWKDWDQ17GHDWKRFFXUUHG7WKHRUDODXWRSV\FRQFOXGHGWKDWDQ17GHDWKGLGQRWRFFXU8VLQJGDWDIURP.HQ\DWKHVHQVLWLYLW\RIWKHRUDODXWRSV\PHWKRGLVWKHVSHFLILFLW\ZDVIRXQGWREH6XSSRVHRIWKHZRPHQVXUYH\HGKDGDQLQIDQWGLHRIQHRQDWDOWHWDQXV
D :KDWLVWKHSUREDELOLW\WKDWWKHRUDODXWRSV\PHWKRGGHFODUHVDQHRQDWDOWHWDQXVGHDWKZKHQWKHZRPDQKDGDQLQIDQWGLHRIQHRQDWDOWHWDQXV"
E :KDWLVWKHSUREDELOLW\WKDWWKHRUDODXWRSV\PHWKRGGRHVQRWGHFODUHDQHRQDWDOWHWDQXVGHDWKZKHQWKHZRPDQGLGQRWKDYHDQLQIDQWGLHRIQHRQDWDOWHWDQXV":KDWLVWKLVYDOXHFDOOHG"
)RUPRUHLQIRUPDWLRQVHHKWWSZZZZKRLQWLPPXQL]DWLRQBPRQLWRULQJGLVHDVHV017(BLQLWLDWLYHHQLQGH[KWPO6QRZ5$UPVWURQJ-50)RUVWHU'HWDO&KLOGKRRGGHDWKVLQ$IULFD8VHVDQGOLPLWDWLRQVRIYHUEDODXWRSVLHVLancet,
F :KDWLVWKHSUREDELOLW\WKDWDZRPDQKDGDQLQIDQWGLHRIQHRQDWDOWHWDQXVJLYHQWKDWWKHRUDODXWRSV\PHWKRGGHFODUHGDQHRQDWDOWHWDQXVGHDWK":KDWLVWKLVYDOXHFDOOHG"
G :KDWLVWKHSUREDELOLW\WKDWDZRPDQGLGQRWKDYHDQLQIDQWGLHRIQHRQDWDOWHWDQXVZKHQWKHRUDODXWRSV\PHWKRGGRHVQRWGHFODUHDQHRQDWDOWHWDQXVGHDWK":KDWLVWKLVYDOXHFDOOHG"
H :KDWDUHWKHLPSOLFDWLRQVRISDUWVFDQGGIRUWKHQHRQDWDOWHWDQXVVXUYH\"
Tutorial: Binomial distribution in Stata
Using Stata to calculate binomial probabilities
Suppose X is a random variable that follows a binomial distribution; thus X represents the
number of successes out of n trials with success probability p.
binomialp(n,k,p) returns the probability of observing floor(k) successes
in floor(n) trials when the probability of a success on one trial is p.
binomial(n,k,p) returns the probability of observing floor(k) or fewer successes
in floor(n) trials when the probability of a success on one trial is p.
binomialtail(n,k,p) returns the probability of observing floor(k) or more successes
in floor(n) trials when the probability of a success on one trial is p.
Example: Uzbeki Flour Fortification Program
In 2003, a flour fortification program was implemented in Uzbekistan to attempt to lower the
rates of anemia among women of reproductive age. Before the program was implemented, the
prevalence of anemia was 60%. In 2007, four years after implementing the fortification women,
suppose 100 women of reproductive age were randomly selected to provide blood samples to test
for anemia. Let X be the random variable denoting how many of the 100 women were anemic.
Suppose that the prevalence of anemia in Uzbekistan did not change between 2003 and 2007.
1. Would the binomial distribution provide an appropriate model?
B binary outcome
I independent because women were randomly selected
N sample size is fixed
S same p
2. What is the expected number of women with anemia?
µ = n p = 60
3. In a random sample of women in Uzbekistan, what is the typical departure of the number of
women with anemia from this mean number?
sd(X) =pvar(X)
=pn p (1 p)
=p100 0.6 0.4
=p24
= 4.9
1
4. What is the probability that exactly 60 women develop the disease? (use the formula)
n
k
pX(1 p)nX =
100
60
0.6600.440 = 0.081
. di comb(100, 60)*0.6^60*0.4^40
.08121914
5. What is the probability that exactly 50 women are anemic?
. di binomialp(100, 50, 0.6)
.01033751
6. What is the probability that at least 50 women are anemic?
. di binomialtail(100, 50, 0.6)
.98323831
Alternatively, we could use the binomial command to calculate this probability, since P (X >50) = 1 P (X 49).
. di binomial(100, 49, 0.6)
.01676169
. di 1 - binomial(100, 49, 0.6)
.98323831
7. Now, assume that the prevalence of anemia actually dropped after implementation of the
program, and the prevalence of anemia was 40% in 2007. Now, what is the probability that
at least 50 women are anemic?
. di binomialtail(100, 50, 0.4)
.0270992
Note that under the assumption of no change in prevalence between 2003 and 2007, the
probability that more than fifty women had anemia was very high. If the prevalence of anemia
dropped to 40%, the probability that at least 50 women were anemia was then very low. So, if we
collected data on 100 women and observed fewer than 50 cases of anemia, this would suggest
that anemia prevalence dropped over time!
2
Tutorial: Poisson distribution in Stata
Using Stata to calculate Poisson probabilities
Suppose X is a random variable that follows a Poisson distribution; X is a count of breastcancer cases.
When X Poisson(m),
poissonp(m,k) returns the probability of observing floor(k) successes
poisson(m,k) returns the probability of observing floor(k) or fewer successes
poissontail(m,k) returns the probability of observing floor(k) or more successes
Example: Ecological Cancer Study
In the United States, the National Cancer Institute (NCI) tracks cancer incidence through theSurveillance Epidemiology and End Results (SEER) database. At various SEER sites, incidentcases of cancer, cancer type, and location are tracked. Using data from SEER, epidemiologistscan monitor patterns in disease risk and find factors, such as socioeconomic status, that arecorrelated with disease.
For instance, Los Angeles County is divided into 2,056 census tracts in the 2000 census.Using the SEER database, we can estimate the number of expected breast cancer cases in eachcensus tract, based on breast cancer incidence rates in California and the age distribution withineach tract (see standardization lectures). Then, we can compare the number of observed cases ineach census tract to the expected, to determine if census tracts have more cases of cancer thanexpected. We can then try to correlate excess breast cancer cases with other area-level variable,in an ecological study.
Below, we have data on breast cancer incidence for the African-American female population ina census tract in LA County. We choose to model the observed number of breast cancer cases ina census tract using the Poisson distribution, with mean equal to the expected number of breastcancer cases in the census tract.
1
Age group Observed Population Cancer rate (per 1,000 p-y) Expected15-24 0 188 0.008 0.00125-34 0 163 0.200 0.03335-44 0 216 0.875 0.18945-54 0 157 1.868 0.29355-64 0 137 2.633 0.36165-74 0 151 3.165 0.47875-84 0 121 3.452 0.41884+ 0 57 3.313 0.189Total 0 1,190 1.648 1.962
Table 1: Census tract 1
1. What is the expected number of women with breast cancer in the census tract 1?
1.962
2. What is the typical departure of the number of women with breast cancer from this meannumber?
sd(X) =pvar(X)
=pµ
=p1.962
= 1.400714
3. Does the Poisson distribution provide an appropriate model?
Count data, so Poisson distribution seems reasonable. Difficult to assess any more informa-tion about model fit without data on many census tracts.
4. What is the probability that exactly 0 women develop breast cancer in census tract 1? (usethe formula)
e1.9621.9620
0!= e1.962 = 0.1406
2
Consider another census tract, with a similar total African-American female popula-
tion to the previous, but with 5 observed breast cancer cases.
Age group Observed Population Cancer rate (per 1,000 p-y) Expected15-24 0 187 0.008 0.00125-34 0 187 0.200 0.03735-44 1 218 0.875 0.19145-54 0 193 1.868 0.36155-64 1 175 2.633 0.46165-74 1 141 3.165 0.44675-84 2 66 3.452 0.22884+ 0 17 3.313 0.056Total 5 1,184 1.504 1.781
Table 2: Census Tract 2.
5. What is the probability that exactly 5 women have breast cancer in census tract 2?
. di poissonp(1.781, 5)
.02515706
6. What is the probability that at least 5 women have breast cancer in census tract 2?
. di poissontail(1.781, 5)
.03504886
Alternatively, we could use the poisson command to calculate this probability, since P (X 5) = 1 P (X 4).
di 1 - poisson(1.781, 4)
.03504886
Takeaway: Census tracts 1 and 2 have similar population sizes and consequently similarexpected breast cancer case counts. However, in census tract 1, we observe no cases; in censustract 2, we observe 5 cases. Using the Poisson distribution, we can calculate the probability ofobserving case counts as extreme as 0 or 5 in these tracts.
Remember that there are about 2,000 total tracts, so we expect to see some extreme ob-servations. We could also incorporate ecological covariates into our analysis, such as medianhousehold income or land-use data, to try to explain some of the differences between observedand expected breast cancer rates.
3
Tutorial: Normal distribution in Stata
Using Stata to calculate Normal probabilities
Suppose Z is a standard normal random variable. When Z Normal(0, 1),
normal(z) returns the cumulative standard normal distribution
normalden(z) returns the standard normal density
Example: Ozone Designation Following the Clean Air Act Amendments of 1997
From 2001-2003, the Environmental Protection Agency (EPA) monitored ozone levels atmonitors across the United States. One criteria for ozone was that the ozone levels (defined asthe average fourth highest daily maximum ozone over the three year period) could not exceed80ppb. Regulatory actions were taken if the ozone levels exceeded this threshold.
Among monitors in the Southeast, the average ozone level was 45.2 ppb, with standarddeviation 6.3 ppb. Ozone levels are usually modeled using the normal distribution. We assumethat this distribution is reasonable in our application.
Define X as ozone level at a monitor. X N(45.2, 39.7), or, equivalently, X N(45.2, 6.32).
1. What is the expected ozone level at a randomly sampled monitor?
45.2 ppb
2. What is the typical departure ozone levels from this mean number?
6.3 ppb
3. Why do you think Stata named the normal density function normalden, rather than normalp,which would seemingly be more consistent with the binomial and Poisson commands?
The normal distribution is continuous, and therefore normalden does not return a proba-bility, but rather a density function.
4. Why do you think Stata only calculates probabilities with respect to the standard normal,or N(0,1), distribution?
I don’t know the answer to this. Seems pretty inconvenient.
5. What is the probability that a randomly selected monitor has ozone levels exceeding 80ppb?
First, standardize:
P (X45.26.3 > 8045.2
6.3 ) = P (Z > 5.524)
. di 1 - normal(5.524)
1.657e-08
1
6. Provide an interpretation of the following command:
. di normalden(0)
.39894228
0.399 is the value of the normal density function at 0. It has no interpretation in terms ofprobability.
2
1
Example Problem: HIV prevalence in South Africa
$FFRUGLQJWR81$,'6 +,9SUHYDOHQFHLQ6RXWK$IULFDZDVDPRQJDGXOWVWR\HDUVROGLQ$VVXPHWKLVSUHYDOHQFHHVWLPDWHLVDFFXUDWHWRGD\DQGZHUDQGRPO\VDPSOHLQGLYLGXDOVLQ6RXWK$IULFD6XSSRVH;LVWKHQXPEHURI+,9SRVLWLYHLQGLYLGXDOVLQWKHVDPSOH
Model X using the binomial distribution.
+RZPDQ\LQGLYLGXDOVGRZHH[SHFWWREH+,9SRVLWLYHLQWKHVDPSOH
(; QS
:KDWLVWKHVWDQGDUGGHYLDWLRQRIWKHQXPEHURI+,9SRVLWLYHLQGLYLGXDOVLQWKHVDPSOH"
VG; ¥QSS
:KDWLVWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9SRVLWLYHLQGLYLGXDOV"
. di 1 - binomial(500, 100, 0.178)
.09089616
. di binomialtail(500, 101, 0.178)
.09089616
:KDWLVWKHSUREDELOLW\RIREVHUYLQJEHWZHHQDQG+,9SRVLWLYHLQGLYLGXDOV"
. di binomial(500, 95, 0.178) - binomial(500, 84, 0.178)
.47533949
. di binomialtail(500, 85, 0.178) - binomialtail(500, 96, 0.178)
.47533949
Now, model X using the normal distribution instead.
:KDWLV(;"
(; QS
:KDWLVVG;"
VG; ¥QSS
:KDWLVWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9SRVLWLYHLQGLYLGXDOV"
3;! 3=!± 3=!
2
. di 1-normal(1.286)
.09922153
:KDWLVWKHSUREDELOLW\RIREVHUYLQJEHWZHHQDQG+,9SRVLWLYHLQGLYLGXDOV"
3;
3;±3;
3=±±3=
3=±3=
. di normal(0.702) - normal(-0.468)
.43876812
'RWKHQRUPDODQGELQRPLDOPRGHOVJLYHVLPLODUUHVXOWV"
:KDWLVWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9SRVLWLYHLQGLYLGXDOV"
%LQRPLDO
1RUPDO
:KDWLVWKHSUREDELOLW\RIREVHUYLQJEHWZHHQDQG+,9SRVLWLYHLQGLYLGXDOV"
%LQRPLDO
1RUPDO
<HVWKH\JLYHVLPLODUUHVXOWV$SSUR[LPDWLRQLVEHWWHU³LQWKHWDLOV´LHIRUFDOFXODWLQJWKHSUREDELOLW\RIREVHUYLQJPRUHWKDQ+,9LQGLYLGXDOVWKDQLQWKHFHQWHURIWKHGLVWULEXWLRQEHWZHHQDQG+,9
http://www.unaids.org/en/regionscountries/countries/southafrica/
Tutorial: Central Limit Theorem in Stata
We examine BMI at baseline using the Framingham cohort as our reference population.
Specifically, we can think of the Framingham population as ‘the population of interest’ and
consider sampling from this population to examine how statistics behave in samples from a
population where we know about everyone.
1. Calculate the mean µ standard deviation BMI in the Framingham dataset at baseline.
. summarize bmi1
µ = 25.8 and = 4.1.
2. Take a sample of size 20 from the Framingham dataset. Calculate a sample mean BMI
at baseline, x1. Then take a second sample from the same population and calculate the
sample mean, x2. Would you expect x1 and x2 to be exactly the same? Why or why not?
use "fhs.dta", clear
drop if bmi1 == .
keep bmi1
preserve
sample 20, count
mean bmi1
restore
preserve
sample 20, count
mean bmi1
We don’t expect x1 and x2 to be exactly the same, because the mean has some stochastic
variability.
3. Repeat this exercise, but with a sample size of 100. Are x1 and x2 closer together than
those from the samples of size 20? Are x1 and x2 always going to be closer together
using a sample size of 100 versus 20?
restore
preserve
sample 100, count
mean bmi1
restore
preserve
sample 100, count
mean bmi1
1
In my sample, the values of the sample mean are closer with the larger sample. This will
usually, but not always, be true.
4. Compare histograms of BMI at baseline and prevalent MI at baseline. Would the central
limit theorem apply to the binary indicator prevalent MI at baseline?
0.0
2.0
4.0
6.0
8.1
Den
sity
10 20 30 40 50 60Body mass index, exam 1
010
2030
40D
ensi
ty
0 .2 .4 .6 .8 1Prevalent myocardial infarction, exam 1
Yes, but the more skewed a distribution is, the larger sample size we need to collect before
the CLT “kicks in”.
2
Tutorial: Confidence and Predictive Intervals in Stata
1. Let X denote a random variable that represents BMI at baseline for the Framingham
cohort. Assume that X is normally distributed. What is the mean of X? The standard
deviation?
. summarize bmi1
2. Construct a 95% predictive interval for X. Pick a random observation from the dataset.
Does your interval contain the BMI for the randomly selected observation?
95% predictive interval for X is defined as µ± 1.96.
3. Suppose we now draw repeated samples of size 100 from the Framingham cohort. What
is a 95% predictive interval for X?
95% predictive interval for X is defined as µ± 1.96/pn.
4. Take a sample of size 100 from the Framingham dataset. Does your predictive interval
for X contain the mean from the 100 person subsample?
. sample 100, count
. sum bmi1
5. Construct a 95% confidence interval for the mean BMI in this sample. Does the 95%
confidence interval contain the mean BMI for the entire cohort?
A 95% CI for µ is defined as X ± 1.96/pn.
1
Tutorial: Confidence intervals with the t-distribution in Stata
Suppose t is a random variable that follows a t-distribution with n degrees of freedom.
tden(n,t) returns the probability density function
of Students t distribution
ttail(n,t) returns the reverse cumulative
(upper tail or survivor) Students t distribution
invttail(n,p) returns the inverse reverse cumulative
(upper tail or survivor) Students t distribution
Note that if ttail(n,t)= p, then invttail(n,p) = t.
Stata will calculate confidence intervals for you:
Calculator: cii n mean sd, level(95)
Function: ci varlist, level(95)
There is no Stata function for calculating confidence intervals for normally distributed data
when the standard deviation is known, since this scenario doesn’t really happen in practice.
1. Calculate the mean and standard deviation of BMI at baseline.
. summarize bmi1
2. Take a sample of size 20 from the Framingham cohort. Calculate the mean and
standard deviation of BMI at baseline in the subsample (I use set seed 2, if you want
to get the same sample as me). We are interested in making inference about BMI at
baseline in the total Framingham cohort using only the sample of size 20.
. set seed 2
. drop if bmi1 == .
. sample 20, count
. sum bmi1
3. Assume that the sample standard deviation is known (and equal to the standard deviation
in the Framingham cohort). Construct a 95% confidence interval for the mean BMI in your
subsample. Note that if normal(z)= p, then invnormal(p) = z.
95% CI: x± Z0.975/pn
. di 25.0 - invnormal(0.975)*4.1/sqrt(20)
. di 25.0 + invnormal(0.975)*4.1/sqrt(20)
4. Use invttail to construct a 95% confidence interval for the mean BMI in your subsample
by hand, now assuming that the sample standard deviation is unknown.
1
. di 25.0 - invttail(19, 0.025)*3.2/sqrt(20)
. di 25.0 + invttail(19, 0.025)*3.2/sqrt(20)
5. Use cii to construct a 95% confidence interval for the mean BMI in your subsample.
. cii 20 25.0 3.2
6. Use ci to construct a 95% confidence interval for the mean BMI in your subsample.
. ci bmi1
2
Tutorial: Hypothesis testing in Stata
In adults over 15 years of age, a resting heart rate around 80bpm is usually consideredaverage. Using a subset of the Framingham cohort, we are going to attempt to make inferenceabout heart rate among “healthy young” adults.
Specifically, we restrict our analysis to adults with the following characteristics at baseline:non-smoker, younger than 40, BMI less than 25, diastolic blood pressure less than 80, andsystolic blood pressure less than 120. There are 61 participants who meet our criteria. Wehypothesize that heart rate at the follow up exam in 1962 would be lower than 80bpm, theresting heart rate for adults with average health.
We are making the somewhat strong assumption that these Framingham participants aregeneralizable to the broader population of healthy young adults (this assumption is necessaryif we want to make inference about heart rate in healthy young adults.) Use the dataset on thiswebpage to answer the following questions:
1. Make a histogram of heart rate at exam 2. Is the normality assumption reasonable?
histogram heartrte2
histogram heartrte2 if heartrte2 < 200
2. You are interested in whether the mean heart rate at exam 2 among healthy young adultsis equal to 80bpm. Perform a hypothesis test at the ↵ = 0.05 level.
(a) What test are you using?One-sample t-test
(b) State your null and alternative hypothesis.
H0 : µ = 80, HA : µ 6= 80
(c) Perform the hypothesis test.
Hypothesis testing in Stata: To examine options for t-tests in Stata, type db ttest.
Or, using the dropdown menu, explore the options inSummaries, tables, and tests/Classical tests of hypothesis/.
. ttest heartrte2 == 80
One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------heartr~2 | 61 76.55738 2.800032 21.86895 70.95648 82.15827------------------------------------------------------------------------------
mean = mean(heartrte2) t = -1.2295
1
Ho: mean = 80 degrees of freedom = 60
Ha: mean < 80 Ha: mean != 80 Ha: mean > 80Pr(T < t) = 0.1118 Pr(|T| > |t|) = 0.2237 Pr(T > t) = 0.8882
. ttesti 61 76.557 21.869 80
One-sample t test------------------------------------------------------------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------
x | 61 76.557 2.800039 21.869 70.95609 82.15791------------------------------------------------------------------------------
mean = mean(x) t = -1.2296Ho: mean = 80 degrees of freedom = 60
Ha: mean < 80 Ha: mean != 80 Ha: mean > 80Pr(T < t) = 0.1118 Pr(|T| > |t|) = 0.2236 Pr(T > t) = 0.8882
What are:
i. your test statistic, t = -1.22ii. the distribution of your test statistic under the null hypothesis t t60
iii. the p-value, 0.2236iv. your decision, and Fail to reject the null hypothesis.v. your interpretation? We do not have enough evidence to suggest that the heart
rate is different from 80 in healthy young adults at follow up.
3. As a diligent statistician, you decide to investigate the issue of the outlier in your dataset.List the information for the outlier.
. list if heartrte2 > 200
4. Repeat the hypothesis test, excluding this observation. What do you find?
. ttest heartrte2 == 80 if heartrte2 < 200
5. As the statistician, what results should you present in your analysis?
2
Example: Atherosclerosis and Physical Activity
Oxidation of components of LDL cholesterol (the bad cholesterol) can result in atherosclerosis,or hardening of the arteries. Elosua et. al (2002) examine the impact of a 16 week physical activityprogram on LDL resistance to oxidation in 17 healthy young adults. After completing the program,the average maximum oxidation rate in the study participants x was 8.2 µmol/min/g, and thesample standard deviation of the maximum oxidation rate was s = 2.5µmol/min/g. Assume thatthe oxidation rate is normally distributed.
• What is the distribution of x?
x µ
s/pn t16.
• Suppose the average maximum oxidation rate in healthy young adults who did not completethe program was µ0 = 11.3µmol/min/g and the standard deviation was = 2.3. Define x0as the sample mean maximum oxidation rate from a sample of size 17 from this population.Construct a 99% predictive interval for x0. Is x in this interval?
. di 11.3 - invnormal(0.995)*2.3/sqrt(17)
. di 11.3 + invnormal(0.995)*2.3/sqrt(17)
• Construct a 99% confidence interval for µ.
. cii 17 8.2 2.5, level(99)
• If you constructed the 99% confidence interval for µ assuming that the standard deviationwas known and equal to = 2.3, would your confidence interval be wider or narrower? Willthis result always be true?
Standard deviation known: x± Z0.99Standard deviation unknown: x± t0.99,16s
• Let µ denote the mean maximum oxidation rate in young adults who participate in the pro-gram. Test the hypothesis that µ = µ0 against the alternative that µ 6= µ0 the ↵ = 0.01 level.What do you conclude?
H0 : µ = µ0, HA : µ 6= µ0
. ttesti 17 8.2 2.5 11.3, level(99)
1
Using a one-sample t-test, we obtain a test statistic of -5.11, which follows a t-distributionwith 16 degrees of freedom under the null hypothesis, corresponding to a p-value of 0.0001.We reject the null at the 99% confidence level and conclude that the data suggest thatthe 16 week physical activity program lowers the maximum oxidation rate in healthy youngindividuals.
Elosua R., Molina L., Fito M., Arquer A., Sanchez-Quesada JL, Covas MI, Ordonez-Llanos J., Marrugat J.(2003)Response of
oxidative stress biomarkers to a 16-week aerobic physical activity program, and to acute physical activity, in healthy young men and
women. Atherosclerosis 167(2), 327-334.
2
Two Sample t-tests in Stata
Example: In the Framingham cohort, we want to examine the distribution of heart rate at exams1 and 2. Specifically, we wish to test whether there is a difference in mean heart rate betweenexam 1 and exam 2. Additionally, we are interested in whether the mean heart rate differsbetween men and women at exam 2. We sample 100 people from the Framingham cohort.For this example, use the dataset heartrate.dta on this webpage, which contains the randomsample of 100 participants.
Hypothesis testing with paired data in Stata:
. ttest heartrte1 == heartrte2
Paired t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------heartr~1 | 100 75.03 1.290247 12.90247 72.46987 77.59013heartr~2 | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------
diff | 100 -1.14 1.344125 13.44125 -3.807035 1.527035------------------------------------------------------------------------------
mean(diff) = mean(heartrte1 - heartrte2) t = -0.8481Ho: mean(diff) = 0 degrees of freedom = 99
Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0Pr(T < t) = 0.1992 Pr(|T| > |t|) = 0.3984 Pr(T > t) = 0.8008
. gen hdiff = heartrte2 - heartrte1
. ttest hdiff== 0
One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------
hdiff | 100 1.14 1.344125 13.44125 -1.527035 3.807035------------------------------------------------------------------------------
mean = mean(hdiff) t = 0.8481Ho: mean = 0 degrees of freedom = 99
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0Pr(T < t) = 0.8008 Pr(|T| > |t|) = 0.3984 Pr(T > t) = 0.1992
The commands ttest heartrte2 == heartrte1 and ttest hdiff==0 lead to the same test.
This command can be found through the following drop-down menus: Statistics / Sum-maries, tables, and tests / Classical tests of hypotheses / Mean-comparison test, paired data.
1
Hypothesis testing with unpaired data and equal variances in Stata:
. ttest heartrte2, by(sex1)
Two-sample t test with equal variances------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------
Male | 39 76.82051 2.042025 12.75244 72.68665 80.95438Female | 61 75.7541 1.681246 13.13095 72.39111 79.11709
---------+--------------------------------------------------------------------combined | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------
diff | 1.066414 2.662326 -4.216884 6.349713------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = 0.4006Ho: diff = 0 degrees of freedom = 98
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.6552 Pr(|T| > |t|) = 0.6896 Pr(T > t) = 0.3448
Hypothesis testing with unpaired data and unequal variances in Stata:
. ttest heartrte2, by(sex1) unequal
Two-sample t test with unequal variances------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------
Male | 39 76.82051 2.042025 12.75244 72.68665 80.95438Female | 61 75.7541 1.681246 13.13095 72.39111 79.11709
---------+--------------------------------------------------------------------combined | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------
diff | 1.066414 2.645081 -4.194674 6.327503------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = 0.4032Ho: diff = 0 Satterthwaite’s degrees of freedom = 82.8637
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.6561 Pr(|T| > |t|) = 0.6879 Pr(T > t) = 0.3439
This command can be found through the following drop-down menus: Statistics / Summaries, tables,and tests / Classical tests of hypotheses / Two-group mean-comparison test.
Instead of the data structure above, suppose that, in your dataset, you have heart rate for men in onevariable/column and heart rate for women in another variable/column (instead of our situation where wehave heart rate in one variable and sex as another variable). How do you perform a t-test then? Use thecommand ttest heartratew == heartratem, unpaired unequal, where heartratew is the heart ratevariable for women and heartratem is the heart rate for men. It is important to use the option unpaired.If you do not use this option, Stata will perform a paired t-test. You may also choose the leave out theunequal option if you wish to assume equal variances.
2
The following 4 lines of code transform the data to the situation where we have heart rate for menin one variable (heartrtem) and heart rate for women in another variable (heartrtew). It is not necessaryto memorize or understand this portion of code. It is simply included for completeness. The fifth line ofcode runs the two sample t-test.
. gen id = _n
. reshape wide heartrte2, i(id) j(sex1)
. rename heartrte21 heartrtem
. rename heartrte22 heartrtew
. ttest heartrtew = heartrtem, unpaired unequal
Two-sample t test with unequal variances------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------heartr~w | 61 75.7541 1.681246 13.13095 72.39111 79.11709heartr~m | 39 76.82051 2.042025 12.75244 72.68665 80.95438---------+--------------------------------------------------------------------combined | 100 76.17 1.293031 12.93031 73.60435 78.73565---------+--------------------------------------------------------------------
diff | -1.066414 2.645081 -6.327503 4.194674------------------------------------------------------------------------------
diff = mean(heartrtew) - mean(heartrtem) t = -0.4032Ho: diff = 0 Satterthwaite’s degrees of freedom = 82.8637
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(T < t) = 0.3439 Pr(|T| > |t|) = 0.6879 Pr(T > t) = 0.6561
This command can be found through the following drop-down menus: Statistics / Summaries, tables,and tests / Classical tests of hypotheses / Two-sample mean-comparison test.Exercises
1. Calculate the sample mean and sample standard deviation of heart rate at exam 1 and exam 2 inthe Framingham cohort.
2. Are these data dependent or independent?
3. Generate a new variable for the difference in heart rate between exam 1 and exam 2. Make ahistogram of this new variable.
4. Perform a hypothesis test at the ↵ = 0.05 level.
(a) What test are you using?
(b) State your null and alternative hypothesis.
(c) Perform the hypothesis test. What are:
i. your test statistic,ii. the degrees of freedom,iii. the p-value,iv. your decision, andv. your interpretation?
3
Now, assume that you are interested in whether the mean heart rate differs between menand women at exam 2.
5. Are these data dependent or independent?
6. Calculate the sample mean and sample standard deviation of heart rate at exam 2 for men andwomen.
7. Perform a hypothesis test at the ↵ = 0.05 level, assuming unequal variances.
(a) What test are you using?
(b) State your null and alternative hypothesis.
(c) Perform the hypothesis test. What are:
i. your test statistic,ii. the degrees of freedom,iii. the p-value,iv. your decision, andv. your interpretation?
8. Given the 95% confidence intervals, would you expect the hypothesis test to be significant?
4
Power and Sample Size in Stata
Power and Sample size in Stata
sampsi - Sample size and power for means and proportions
Power
sampsi 18.4 20.4, sd1(2.8) n1(20) onesample
Sample Size
sampsi 18.4 20.4, sd1(2.8) power(.90) onesample
The notation changes slightly for two-sample or one-sided tests. Type db sampsi to see alloptions available within the sampsi command or select from the drop-down menus: Statistics /Power and sample size / Tests of means and proportions.
Example: Suppose we aim to implement a new physical activity program among school-agedchildren between 6 and 11 years old at high risk for obesity. We define high-risk children asthose children who do less than 2 hours of physical activity per week. According to Ogden(2012), mean BMI among children 6-11 years old in the United States was 18.4 between 2009and 2010, with standard deviation 2.8. Before implementing this program, we want to performa baseline survey, to evaluate the state of the obesity epidemic among the high risk children.We plan to design the survey to test whether the mean BMI in the high risk children is equal tothe mean BMI among 6-11 year olds in the United States at the ↵ = 0.05 level. To design thestudy, assume the standard deviation of BMI is equal in the general population and the highrisk children.
Ogden C.L., Carroll M.D., Kit B.K., and Flegal K.M. (2012). Prevalence of Obesity and Trends in Body Mass Index Among US
Children and Adolescents, 1999-2010. JAMA: The Journal of the American Medical Association. 307 (5). 483–490.
1. State the null and alternative hypothesis for the test above.
H0 : µ = 18.4
HA : µ 6= 18.4
2. Fill in the table below:
1
Sample Size µA Power
100 19.4
200 18.9
10,000 18.4
20.4 0.9
19.4 0.8
19.4 0.9
Now, suppose we powered our study for the one-sided test that the mean BMI is equal
to 18.4 versus the alternative that the mean is higher in the high risk children. Repeat
the calculations above and compare to the two-sided calculations.
Power: sampsi 18.4 20.4, sd1(2.8) n1(20) onesample onesided
Sample Size: sampsi 18.4 20.4, sd1(2.8) power(.90) onesample onesided
1. State the null and alternative hypothesis for the test above.
H0 : µ = 18.4
HA : µ > 18.4
2. Fill in the table below:
Sample Size µA Power
100 19.4
200 18.9
10,000 18.4
20.4 0.9
19.4 0.8
19.4 0.9
Suppose we also wanted to investigate whether the BMI among high risk children dif-
fered between boys and girls. Let us assume that the standard deviations of BMI among
2
high risk children are both equal to 2.8.
Power: sampsi 18.4 20.4, sd1(2.8) sd2(2.8) n1(20) n2(20)
Sample Size: sampsi 18.4 20.4, sd1(2.8) sd2(2.8) power(.90)
1. State the null and alternative hypothesis for the test above.
H0 : µB = µG
HA : µB 6= µG
2. Let µG and µB denote the mean BMI in boys and girls, respectively; let nB and nG denotethe sample size required for boys and girls. Fill in the table below:
nB nG µB µG Power
20 20 20.4 18.4
20 20 19.4 18.4
20.4 19.4 0.9
22.4 18.4 0.8
3
Tutorial: ANOVA in Stata
In this example, we will use data from the California Health Interview Survey (CHIS). Fromtheir website (http://www.chis.ucla.edu): CHIS is the nation’s largest state health survey. Con-ducted every two years on a wide range of health topics, CHIS data gives a detailed pictureof the health and health care needs of California’s large and diverse population. CHIS is con-ducted by the UCLA Center for Health Policy Research in collaboration with many public agen-cies and private organizations.
In 2009, CHIS surveyed more than 47,000 adults, more than 12,000 teens and children andmore than 49,000 households. We will use a sample of 500 adults for this lab (CHISANOVA.dta).Suppose we are interested in the relationship between number of hours worked (per week) andhealth, as measured by BMI. Would we expect those who worked longer hours to be healthierthan those who worked shorter hours, or vice versa? Number of hours worked per week isdivided into 5 categories: 0-10, 10-25, 25-35, 35-45, 45+.
1. How many people are in each category?
2. We now wish to run an ANOVA. Are the assumptions for ANOVA met?
3. What are the null and alternative hypotheses for this test?
4. Perform the hypothesis test at the ↵ = 0.05 level.
Conduct a oneway ANOVA in Stata using the oneway command:
. oneway bmi work_cat, tabulate
| Summary of bmi
work_cat | Mean Std. Dev. Freq.
------------+------------------------------------
0-10 | 26.431579 5.9410147 38
10-25 | 26.429189 5.7075504 74
25-35 | 24.3495 4.1477871 60
35-45 | 27.128351 5.647101 188
45+ | 27.854928 6.1797228 140
------------+------------------------------------
Total | 26.8419 5.7540637 500
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 550.823688 4 137.705922 4.27 0.0021
Within groups 15970.6916 495 32.2640234
------------------------------------------------------------------------
Total 16521.5153 499 33.1092491
Bartlett’s test for equal variances: chi2(4) = 11.7543 Prob>chi2 = 0.019
1
You may also use the following drop-down menus to access the oneway command: Statis-tics / Linear models and related / ANOVA/MANOVA / One-way ANOVA.
What are:
(a) your test statistic,(b) the degrees of freedom,(c) the p-value,(d) your decision, and(e) your interpretation?
5. We have rejected the null hypothesis, thus we have evidence that at least one pair ofmeans are not equal. Perform all possible pairwise comparisons using the Bonferronicorrection.
6. Which pairs of means are significantly different?
7. A colleague of yours, who has the same dataset, calculates the means for each workcategory. After looking at these means he takes the group with the largest mean (45+)and the group with the smallest mean (25-35) and performs a t-test (without a Bonferronicorrection). He tells you that since he only did one test, he does not need to correct formultiple comparisons and that his method is valid. Do you agree? Why or why not?
2
Tutorial: Methods for one-sample proportion inference
,QWKLVWXWRULDOZHOHDUQDERXW6WDWDFRPPDQGVIRURQHVDPSOHSURSRUWLRQLQIHUHQFHConfidence intervals: ci DQGcii ±FDOFXODWHELQRPLDOFRQILGHQFHLQWHUYDOVHypothesis Tests: bitest DQGbitesti ±H[DFWELQRPLDORQHVDPSOHSURSRUWLRQK\SRWKHVLVWHVWprtestDQGprtesti±ODUJHVDPSOHRQHVDPSOHSURSRUWLRQK\SRWKHVLVWHVW5HFDOOWKDWWKHH[WUDÄLDWWKHHQGRID6WDWDFRPPDQGQDPHGHQRWHVWKDWWKHFRPPDQGLV³LPPHGLDWH´DQGGRHVQRWXVHWKHGDWDLQPHPRU\Exercises
(VWLPDWHWKHSURSRUWLRQRI&DOLIRUQLDUHVLGHQWVZKRYLVLWWKHGRFWRUDWOHDVWRQFHLQWKHSUHYLRXV\HDUGHQRWHGp
. tabulate doctor
2. &RQVWUXFWDFRQILGHQFHLQWHUYDOIRUpXVLQJWKUHHGLIIHUHQWPHWKRGV&DQZHXVH
WKHQRUPDODSSUR[LPDWLRQWRWKHELQRPLDOGLVWULEXWLRQ"+RZGRWKHZLGWKVRIWKHVHWKUHH&,’VFRPSDUH"
. ci doctor, binomial . ci doctor, binomial wald . ci doctor, binomial Wilson
([DFWQHYHUKDVORZHUWKDQH[SHFWHGFRYHUDJHEXWLVVRPHWLPHVWRRFRQVHUYDWLYH:DOG/DUJHVDPSOHEDGFRYHUDJHHDV\WRFDOFXODWHIOH[LEOH:LOVRQ/DUJHVDPSOHJRRGFRYHUDJHOHVVIOH[LEOH
8VLQJWKHFRQILGHQFHOHYHOLVWKHUHHYLGHQFHLQWKHGDWDWKDWOHVVWKDQRIWKHSRSXODWLRQYLVLWVWKHGRFWRURQFHSHU\HDU"5HSHDWWKLVDQDO\VLVVWUDWLI\LQJE\DERYHEHORZSRYHUW\JURXSV
. bysort poverty: ci doctor, binomial
/HW’VIRUPDOL]HTXHVWLRQXVLQJDK\SRWKHVLVWHVW/HWpGHQRWHWKHSURSRUWLRQRI&DOLIRUQLDUHVLGHQWVEHORZWKHIHGHUDOSRYHUW\OHYHOZKRYLVLWHGWKHGRFWRUDWOHDVWRQFHLQWKHSDVW\HDU7HVWWKHK\SRWKHVLVWKDWp YHUVXVWKHDOWHUQDWLYHWKDWpDWWKHĮ OHYHOD)LUVWXVHWKHH[DFWELQRPLDOWHVW:KDWLVWKHSYDOXH"
bitest doctor == 0.8 if poverty == 1
E1H[WXVHWKHQRUPDODSSUR[LPDWLRQWRWKHELQRPLDOGLVWULEXWLRQ. prtest doctor == 0.8 if poverty==1
,VWKHQRUPDODSSUR[LPDWLRQDSSURSULDWH"
Q S!Q S!
7KHUHIRUHWKHQRUPDODSSUR[LPDWLRQWRWKHELQRPLDOLVDSSURSULDWH
:KDWLVWKHYDOXHRI\RXUWHVWVWDWLVWLF"
=
:KDWLVWKHGLVWULEXWLRQRI\RXUWHVWVWDWLVWLFXQGHUWKHQXOOK\SRWKHVLV"
=a1
:KDWLVWKHSYDOXHRI\RXUWHVW"
S
'R\RXUHMHFWRUQRWUHMHFWWKHQXOOK\SRWKHVLV"
:HUHMHFWWKHQXOOK\SRWKHVLV
:KDWGR\RXFRQFOXGH"
:HFRQFOXGHWKDWWKHUHLVHYLGHQFHLQWKHGDWDWKDWpLVOHVVWKDQ
*LYHQWKDW\RXJRWGLIIHUHQWUHVXOWVXVLQJWKHH[DFWDQGODUJHVDPSOHK\SRWKHVLVWHVWVZKDWZRXOG\RXGRLI\RXZHUHZULWLQJDSDSHU"
7KHUHDUHQRPHDQLQJIXOGLIIHUHQFHVEHWZHHQDSYDOXHRIDQGWU\WRLQFOXGHFRQILGHQFHLQWHUYDOVLQSUDFWLFHDVSYDOXHVGRQ’WWHOO\RXDQ\WKLQJDERXWWKHPDJQLWXGHRIDQHIIHFW
Two Sample Proportion Tests in Stata
Before delving into two-way associations using contingency (two-by-two) tables, we first ex-amine the structure of the two-sample test of proportions, using the normal approximation tothe binomial.
Exercises:
1. How might we define a test statistic for comparing two proportions? Specifically, wewould like to test the hypothesis that H0 : p1 = p0 versus the alternative that p1 6= p0 atthe ↵ = 0.05 level. How does this test compare to the two-sample mean test for normallydistributed data from last week?
Recall the two-sample t-test for equal variances:
Assume X1 N(µ1,2), and the sample mean of multiple realizations of X1 is x1 and
sample standard deviation is s1; and X2 N(µ2,2), and the sample mean of multiple
realizations of X2 is x2 and sample standard deviation is s2.
To test H0 : µ1 = µ2 vs. HA : µ1 6= µ2, our test statistic for the two-sample t-test withequal variances was:
t =x1 x2
sp
q1n1
+ 1n2
H0 tn1+n22
Remember: the variance is independent of the mean for normally distributed data. Forbinomial data, the variance is a function of the mean.
For binomial data:
• Assume X1 Binomial(n1, p1) and X0 Binomial(n0, p0).
• Define p1 = X1/n1 and p0 = X0/n0.
• Using the Central Limit Theorem, we know that p1 N(p1, p1(1 p1)/n1) and p0 N(p0, p0(1 p0)/n0).
• Under the null hypothesis that p1 = p0, p1 p0 N(0, V ), where V = p(1 p)
1n1
+ 1n0
and p = X1+X0
n1+n0.
• Therefore, a natural test statistic for testing H0 : p1 = p0, HA : p1 6= p0 is:
p1 p0rp(1 p)
1n1
+ 1n0
H0 N(0, 1)
For binomial data, the structure of the test statistic is similar to the two-sample t-test withequal variances, because, under the null, the variances are equal in both groups.
1
2. Let p1/p0 denote the proportion of CA residents below/above the federal poverty level whovisited the doctor at least once in the past year. Test the hypothesis that p1 = p0 versusthe alternative that p1 6= p0 at the ↵ = 0.05 level. What do you conclude? Report a 95%CI along with your results.
• What test are you using? Is normality reasonable?
tabulate doctor
Check that n1p > 5, n1(1 p) > 5, n0p > 5, n0(1 p) > 5, where p = 0.804.
Two-group proportion test in Stata
. prtest doctor, by(poverty)
• What is the value of your test statistic?Z = 2.3
• What is the distribution of your test statistic?Z N(0, 1)
• What is the p-value of your test?p = 0.024
• Do you reject or not reject the null hypothesis?Reject H0
• What do you conclude?There is evidence in the data that individuals in CA who are below poverty are lesslikely to go to the doctor.
3. Based on these data, you decide to conduct an intervention among those below thepoverty line. You randomize individuals to intervention or no intervention. Suppose youpower your study to detect a 15% risk difference with 90% power, assuming the proportionin the control group would equal the estimated proportion among those below poverty(70%) in this study. What sample size would you need, with equal numbers of individualsper arm, if you plan to conduct your test at the ↵ = 0.05?
. sampsi 0.7 0.85, power(0.9) alpha(0.05)
2
Tutorial: Contingency Tables
$ZHOONQRZQVWDWLVWLFLDQRQFHVDLG³$3K'VWXGHQWFRXOGZULWHDQHQWLUHGLVVHUWDWLRQRQWZRE\WZRWDEOHVRQO\´&RQWLQXLQJRXUKHDOWKGLVSDULWLHVUHVHDUFKZHQRZFRQVLGHUWKHRGGVUDWLRDQGWKH3HDUVRQ&KLVTXDUHWHVW Exercises
8VLQJGDWDIURPWKHUHVSRQGHQWVRIWKH&+,6VXUYH\FRQVWUXFWD[WDEOHFRPSDULQJSRYHUW\OHYHOYHUVXVSDVWGRFWRUYLVLW'LVSOD\WKHURZIUHTXHQFLHVDQGWKHH[SHFWHGFHOOFRXQWV. tabulate poverty doctor, row expected
&RQVWUXFWWKHRGGVUDWLRDQGFRUUHVSRQGLQJFRQILGHQFHLQWHUYDO&,IRUWKHYLVLWLQJ
WKHGRFWRULQWKHSDVWPRQWKVIRUWKRVHDERYHDQGEHORZWKHSRYHUW\OLQH
. gen nopov = 1-pov
. cs doctor nopov, or woolf
1RWLFHWKDWWKH:RROIRSWLRQLVXVHGGHQRWLQJWKDWZHZDQWVWDQGDUGHUURUVFDOFXODWHGXVLQJWKHIRUPXODSUHVHQWHGLQFODVV
&RQGXFWD3HDUVRQ¶VFKLVTXDUHWHVWWRH[DPLQHWKHDVVRFLDWLRQEHWZHHQSRYHUW\DQG
SULRUGRFWRUYLVLW
. cs doctor poverty, or woolf
25XVHtabulate«
. tabulate poverty doctor, expected chi2
1RWHWKDWWDEXODWHH[WHQGVQLFHO\WR5[&WDEOHV«
. tabulate racecat doctor, expected chi2
:KDWDUHWKHQXOODQGDOWHUQDWLYHK\SRWKHVHV"
Null:QRDVVRFLDWLRQEHWZHHQDERYHEHORZSRYHUW\OLQHDQGZKHWKHUDQLQGLYLGXDOYLVLWHGWKHGRFWRULQWKHSDVWPRQWKVAlternative:WKHUHLVDQDVVRFLDWLRQ
Null:25 Alternative:25
$UHWKHH[SHFWHGFHOOFRXQWVVXIILFLHQWO\ODUJH"
$OOH[SHFWHGFHOOFRXQWVDUHJUHDWHUWKDQ
:KDWLVWKHYDOXHDQGGLVWULEXWLRQRIWKHWHVWVWDWLVWLFXQGHUWKHQXOOK\SRWKHVLV"
ȋ aȋ
:KDWLVWKHSYDOXH"
S
'R\RXUHMHFWWKHQXOOK\SRWKHVLV":KDWLV\RXUFRQFOXVLRQ"
:HUHMHFWWKHQXOOK\SRWKHVLVDQGFRQFOXGHWKDWWKHUHLVHYLGHQFHLQWKHGDWDWKDWWKHRGGVRIYLVLWLQJWKHGRFWRULQWKHSDVWPRQWKVDUHKLJKHULQWKRVHZKRDUHDERYHWKHSRYHUW\OLQH
)RUWKHDERYHDQGEHORZSRYHUW\JURXSVFRPSDUHWKHIROORZLQJ
&,IRUWKHRGGVUDWLR &,IRUWKHGLIIHUHQFHLQWZRSURSRUWLRQVIURPWKHSUHYLRXVWXWRULDO WKHSYDOXHIURPWKHWZRVDPSOHSURSRUWLRQWHVWIURPWKHSUHYLRXVWXWRULDO WKHSYDOXHIURPWKH3HDUVRQ&KLVTXDUHWHVW
D'R\RXJHWWKHVDPHJHQHUDOFRQFOXVLRQZLWKHDFKWHVW"
'LIIHUHQFHLQSURSRUWLRQVEHWZHHQDERYHDQGEHORZSRYHUW\JURXSV
ZLWK&,
2GGVUDWLRIRUDERYHDQGEHORZJURXSV
ZLWK&,
3HDUVRQ&KLVTXDUH
S
7ZRVDPSOHSURSRUWLRQWHVW
S
E:KLFKWHVWGR\RXILQGPRVWXVHIXO"
,W¶VDOZD\VJRRGWRVKRZERWKDSYDOXHDQGDFRQILGHQFHLQWHUYDO1RWHWKDW\RXFDQ¶WVKRZDFRQILGHQFHLQWHUYDOIRUWKHULVNGLIIHUHQFHLI\RXKDYHDFDVHFRQWUROVWXG\
7XWRULDO,QIHUHQFHIRU3DLUHG'DWDXVLQJ0F1HPDU¶V7HVW3DUW
&RQVLGHUWKHIROORZLQJVWXG\IURP'HNNHUVHWDOWKDWFRPSDUHGWZRGLIIHUHQWVFUHHQLQJWHVWVIRUGHWHUPLQLQJDGUHQDOLQVXIILFLHQF\$GUHQDOLQVXIILFLHQF\LVDFRQGLWLRQLQZKLFKWKHDGUHQDOJODQGVGRQRWSURGXFHDGHTXDWHDPRXQWVRIFHUWDLQKRUPRQHV7KHVFUHHQLQJWHVWLQYROYHVPHDVXULQJDSDWLHQW¶VFRUWLVROUHVSRQVHDIWHUDGPLQLVWUDWLRQRIDQLQWUDYHQRXVEROXVRIDGUHQRFRUWLFRWURSLFKRUPRQH$&7+&XUUHQWO\WZRGRVHVRI$&7+DUHXVHGIRUGLDJQRVWLFSXUSRVHVLQSDWLHQWVZLWKVXVSHFWHGDGUHQDOLQVXIILFLHQF\ȝJDQGȝJ'HNNHUVHWDO7KHUHLVDQRQJRLQJGHEDWHDERXWZKLFKGRVHVKRXOGEHXVHGIRUWKHLQLWLDODVVHVVPHQWRIDGUHQDOIXQFWLRQ'HNNHUVHWDO7KHJRDORIWKLVVWXG\ZDVWRFRPSDUHWKHFRUWLVROUHVSRQVHRIWKHȝJDQGȝJ$&7+WHVWDPRQJSDWLHQWVZLWKVXVSHFWHGDGUHQDOLQVXIILFLHQF\3DWLHQWVZLWKFRUWLVROFRQFHQWUDWLRQVRIQPROODIWHU$&7+VWLPXODWLRQFRQVLGHUHGQRUPDOFRUWLVROUHVSRQVHZHUHFODVVLILHGDVQRWKDYLQJDGUHQDOLQVXIILFLHQF\7KLVZDVDUHWURVSHFWLYHFRKRUWVWXG\ZKHUHE\SDWLHQWVZKRUHFHLYHGERWKWKHȝJDQGȝJ$&7+WHVWEHWZHHQ-DQXDU\DQG'HFHPEHUZHUHLQFOXGHGIRUDQDO\VLV7KHGDWDFDQEHIRXQGLQWKHAI.dtaGDWDVHW6RXUFH'HNNHUV207LPPHUPDQV-06PLW-:5RPLMQ-$3HUHLUD$0&RPSDULVRQRIWKHFRUWLVROUHVSRQVHVWRWHVWLQJZLWKWZRGRVHVRI$&7+LQSDWLHQWVZLWKVXVSHFWHGDGUHQDOLQVXIILFLHQF\Eur J Endocrinol-DQ
6LQFHWKLVLVSDLUHGGDWDZHGHFLGHWRXVH0F1HPDU¶VWHVW6WDWHWKHQXOODQGDOWHUQDWLYHK\SRWKHVLVIRU0F1HPDU¶VWHVW1XOO7KHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVWLVWKHVDPHDVWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVW$OWHUQDWLYH7KRVHSURSRUWLRQVDUHQRWHTXDO,VWKLVWKHVDPHDVWHVWLQJWKDWWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVnotKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVWLVWKHVDPHDVWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVnot KDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVW"
8VHWKHtableFRPPDQGWRVXPPDUL]HWKHGDWD
. tabulate one two | two one | Abnormal Normal | Total -----------+----------------------+---------- Abnormal | 42 19 | 61 Normal | 14 132 | 146 -----------+----------------------+---------- Total | 56 151 | 207
D +RZPDQ\GLVFRUGDQWSDLUVDUHWKHUH"
&DUU\RXW0F1HPDU¶VWHVWLQ6WDWDDWWKHĮ VLJQLILFDQFHOHYHO. mcc one two | Controls | Cases | Exposed Unexposed | Total -----------------+------------------------+------------ Exposed | 132 14 | 146 Unexposed | 19 42 | 61 -----------------+------------------------+------------ Total | 151 56 | 207 McNemar's chi2(1) = 0.76 Prob > chi2 = 0.3841 Exact McNemar significance probability = 0.4869 Proportion with factor Cases .705314 Controls .7294686 [95% Conf. Interval] --------- -------------------- difference -.0241546 -.0832778 .0349687 ratio .9668874 .8962794 1.043058 rel. diff. -.0892857 -.2991256 .1205541 odds ratio .7368421 .3418529 1.550025 (exact)
D :KDWLVWKHWHVWVWDWLVWLF"1XOOGLVWULEXWLRQ"3YDOXH"7KHWHVWVWDWLVWLFLV7KHQXOOGLVWULEXWLRQRIWKHWHVWVWDWLVWLFLVFKLVTXDUHGZLWKGHJUHHRIIUHHGRP7KHSYDOXHLV1RWHWKHUHLVDQH[DFWWHVWYHUVLRQRI0F1HPDU¶VWHVWEDVHGRQWKHELQRPLDOGLVWULEXWLRQOHDGLQJWRDSYDOXHRIZKLFKZDVWKHSYDOXHUHSRUWHGLQWKHSDSHU
E :KDWLV\RXUFRQFOXVLRQ"6LQFHRXUSYDOXHLVJUHDWHUWKDQZHIDLOWRUHMHFWWKHQXOOK\SRWKHVLV7KXVZHKDYHQRHYLGHQFHWKDWWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVWLVGLIIHUHQWIURPWKHSURSRUWLRQRISDWLHQWVFODVVLILHGDVKDYLQJDGUHQDOLQVXIILFLHQF\XVLQJWKHȝJWHVW
Tutorial: Inference for Matched Data using McNemar’s Test 7RLQFRUSRUDWHPRUHLQGLYLGXDOLQIRUPDWLRQLQWRRXUDQDO\VLVZHPDWFKLQGLYLGXDOVZKRZHUHEHORZWKHSRYHUW\OLQHWRDQLQGLYLGXDOZKRZDVDERYHWKHSRYHUW\OLQHEDVHGRQDJHXUEDQYVUXUDOORFDWLRQUDFHDQGJHQGHU1RWHWKDWZHFRXOGLQFRUSRUDWHPRUHFRYDULDWHVWRLPSURYHWKHPDWFKHV:HFRQGXFW0F1HPDU¶VWHVWWRH[DPLQHWKHUHODWLRQVKLSEHWZHHQSRYHUW\DQGGRFWRUYLVLWVDPRQJPDWFKHGSDLUV2SHQWKHGDWDVHWchis_matched.dta . mcc doctor_0 doctor_1
6WDWHWKHQXOODQGDOWHUQDWLYHK\SRWKHVLVIRU0F1HPDU¶VWHVWNull:WKHUHLVQRDVVRFLDWLRQEHWZHHQSRYHUW\DQGYLVLWLQJWKHGRFWRULQWKHSDVWPRQWKVAlternative:WKHUHLVDQDVVRFLDWLRQEHWZHHQSRYHUW\DQGYLVLWLQJWKHGRFWRULQWKHSDVWPRQWKV
$VXEWOHVLGHQRWHZHDUHQRZJHQHUDOL]LQJWRDVOLJKWO\GLIIHUHQWSRSXODWLRQ%HFDXVHRIWKHZD\ZHLPSOHPHQWHGRXUPDWFKLQJVFKHPHZHDUHQRORQJHUPDNLQJLQIHUHQFHDERXWDOO&DOLIRUQLDUHVLGHQWV5DWKHUZHDUHPDNLQJLQIHUHQFHZLWKUHVSHFWWRWKHSRSXODWLRQZLWKDFRYDULDWHSDWWHUQDJHUDFHORFDWLRQDQGJHQGHUVLPLODUWRWKHSRSXODWLRQEHORZWKHSRYHUW\OHYHO
+RZPDQ\SDLUVFRQWULEXWHWRWKHWHVWVWDWLVWLF"
2QO\GLVFRUGDQWSDLUVFRQWULEXWHWRWKHWHVWVWDWLVWLF
'XHWRWKHVPDOOVDPSOHVL]HQXPEHURIGLVFRUGDQWSDLUVOHVVWKDQWKHQRUPDODSSUR[LPDWLRQLVGXELRXVLQWKLVLQVWDQFH7KHUHLVDQH[DFWWHVWEDVHGRQWKHELQRPLDOGLVWULEXWLRQZKLFKGRHVQRWUHO\RQODUJHVDPSOHDSSUR[LPDWLRQV
8VLQJDODUJHVDPSOHWHVWZKDWLVWKHWHVWVWDWLVWLF"1XOOGLVWULEXWLRQ"3YDOXH"&RPSDUHWRWKHH[DFWWHVW
Ȥ aȤ
S
1RWHWKDWWKHPRUHFRQVHUYDWLYHH[DFWWHVWUHVXOWVLQDSYDOXHRIVLPLODUWRWKHODUJHVDPSOHUHVXOW
:KDWLVWKHRGGVUDWLR"&RPSDUHWRWKH25IURPWKHQRQPDWFKHGDQDO\VLV
)URPWKHQRQPDWFKHGDQDO\VLVWKH25ZDVZLWK&,
&RPSDULQJWKHUHVXOWVRIWKH0F1HPDU¶VWHVWWRWKH3HDUVRQ&KLVTXDUHWHVWFRQVLGHUWKHIROORZLQJTXHVWLRQ(YHQWKRXJKZHKDYHGHFUHDVHGWKHVDPSOHVL]HGRZHJDLQSRZHUE\PDWFKLQJ":KLFKWHVWSURYLGHVVWURQJHUHYLGHQFHWKDWSRYHUW\LPSDFWVZKHWKHURUQRWVRPHRQHJRHVWRWKHGRFWRUHDFK\HDU"
%HFDXVHZHDUHFRPSDULQJGRFWRUYLVLWVDPRQJVLPLODULQGLYLGXDOVZHJDLQVRPHSRZHUE\PDWFKLQJ
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.
Country profile:
Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
• Simple Random Sampling - We randomly sample 1,000 people from Inventia.
• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.
• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.
• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.
Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a finite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
• N is the total population size
• n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =1 n
N
Survey Weights wi = P ( individual i is included in the survey)1 = Nn
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple random sampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the finite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratified Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
• N is the total population size
• Nj is the population in province j, j = 1, 2, 3, 4
• nj individuals are sampled from province j
The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.
Finite Population Correction: 1 fj =1 nj
Nj
Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj
nj
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.
Notation:
• N is the total population size
• Nk is the population size in district k, k = 1, ..., 146
• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)
• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)
In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.
Finite Population Correction:
Stage I: 1 fI =1 nI
NI
Stage II: 1 fk =1 nk
Nk
Survey Weights’:
wik = P (individual i in cluster k is in the survey)1
= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1
=NI
nI Nk
nk
Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratified Cluster Sampling
We could combine stratified, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P (person i in district j in province 2 in survey )
= P (district j in survey | province 2)P (person i in survey | district j)
=5
42 50
districtsizej
Finite population correction:
Stage I: #sampled districtstotal#districts in the province
Stage II: #sampled per districtdistrict population = 50
districtsizejfor district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.
Country profile:
Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
• Simple Random Sampling - We randomly sample 1,000 people from Inventia.
• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.
• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.
• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.
Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a finite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
• N is the total population size
• n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =1 n
N
Survey Weights wi = P ( individual i is included in the survey)1 = Nn
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple random sampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the finite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratified Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
• N is the total population size
• Nj is the population in province j, j = 1, 2, 3, 4
• nj individuals are sampled from province j
The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.
Finite Population Correction: 1 fj =1 nj
Nj
Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj
nj
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.
Notation:
• N is the total population size
• Nk is the population size in district k, k = 1, ..., 146
• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)
• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)
In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.
Finite Population Correction:
Stage I: 1 fI =1 nI
NI
Stage II: 1 fk =1 nk
Nk
Survey Weights’:
wik = P (individual i in cluster k is in the survey)1
= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1
=NI
nI Nk
nk
Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratified Cluster Sampling
We could combine stratified, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P (person i in district j in province 2 in survey )
= P (district j in survey | province 2)P (person i in survey | district j)
=5
42 50
districtsizej
Finite population correction:
Stage I: #sampled districtstotal#districts in the province
Stage II: #sampled per districtdistrict population = 50
districtsizejfor district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.
Country profile:
Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
• Simple Random Sampling - We randomly sample 1,000 people from Inventia.
• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.
• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.
• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.
Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a finite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
• N is the total population size
• n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =1 n
N
Survey Weights wi = P ( individual i is included in the survey)1 = Nn
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple random sampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the finite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratified Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
• N is the total population size
• Nj is the population in province j, j = 1, 2, 3, 4
• nj individuals are sampled from province j
The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.
Finite Population Correction: 1 fj =1 nj
Nj
Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj
nj
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.
Notation:
• N is the total population size
• Nk is the population size in district k, k = 1, ..., 146
• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)
• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)
In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.
Finite Population Correction:
Stage I: 1 fI =1 nI
NI
Stage II: 1 fk =1 nk
Nk
Survey Weights’:
wik = P (individual i in cluster k is in the survey)1
= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1
=NI
nI Nk
nk
Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratified Cluster Sampling
We could combine stratified, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P (person i in district j in province 2 in survey )
= P (district j in survey | province 2)P (person i in survey | district j)
=5
42 50
districtsizej
Finite population correction:
Stage I: #sampled districtstotal#districts in the province
Stage II: #sampled per districtdistrict population = 50
districtsizejfor district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of adisease, say malaria, in a hypothetical country, called “Inventia”.
Country profile:
Province Population size Number of districts1 225,000 502 150,000 423 100,000 324 25,000 23Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid andat a higher altitudes than the rest of the country. Consequently, the prevalence of malaria pdiffers between different provinces. Also, access to malaria prevention is not consistent acrossthe country, and subsequently p may also vary somewhat between districts. (For instance, ur-ban populations may have more resources to prevent malaria, and thus a lower prevalence.)The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
• Simple Random Sampling - We randomly sample 1,000 people from Inventia.
• Stratified Sampling - We randomly sample 250 people from each of the 4 provinces ofInventia.
• Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample40 people within each district.
• Stratified Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must first svyset your data. This commandtells Stata what survey design was used to obtain the data. This includes specification of surveyweights, the finite population correction(s), and levels of clustering and stratification.
Once Stata has this information, it incorporates the specified design elements into its calcu-lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a finite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
• N is the total population size
• n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =1 n
N
Survey Weights wi = P ( individual i is included in the survey)1 = Nn
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple random sampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the finite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratified Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
• N is the total population size
• Nj is the population in province j, j = 1, 2, 3, 4
• nj individuals are sampled from province j
The important design question in stratified sampling is how to choose the sample size withineach stratum. In our case, N1 = 225, 000, N2 = 150, 000, N3 = 100, 000 and N4 = 25, 000.nj = 250 for each j.
Finite Population Correction: 1 fj =1 nj
Nj
Survey Weights: wij = P ( individual i in strata j is in the survey)1 = Nj
nj
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-domly sample 40 people.
Notation:
• N is the total population size
• Nk is the population size in district k, k = 1, ..., 146
• nI out of NI total districts are sampled for inclusion in the survey (primary sampling unit)
• We sample nk individuals in district k are selected for inclusion in the survey (secondarysampling unit)
In our survey, nI = 25, NI = 146, nk = 40, and Nk is the population size in district k.
Finite Population Correction:
Stage I: 1 fI =1 nI
NI
Stage II: 1 fk =1 nk
Nk
Survey Weights’:
wik = P (individual i in cluster k is in the survey)1
= [P (cluster k selected) P ( individual i in cluster k selected | clusterk selected)]1
=NI
nI Nk
nk
Exercise: Estimate the prevalence of malaria in Inventia, using only the first stage finite popu-lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratified Cluster Sampling
We could combine stratified, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P (person i in district j in province 2 in survey )
= P (district j in survey | province 2)P (person i in survey | district j)
=5
42 50
districtsizej
Finite population correction:
Stage I: #sampled districtstotal#districts in the province
Stage II: #sampled per districtdistrict population = 50
districtsizejfor district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Tutorial: Non-response bias in surveys
Non-response is a huge issue in many surveys (Groves and Peytcheva, 2008). Survey non-response leads to significant bias if response is correlated with the survey indicators of interest.We use a simple example from the Framingham study to illustrate this concept.
Source: Groves, R.M. and Peytcheva, E. (2008). The impact of nonresponse rates on nonresponsebias. Public opinion quarterly, 72(2): 167-89.(I found a free draft via Google.)
Example:
• Suppose blood samples from the participants at baseline got lost; rather than measureeveryone in the population again, the study investigators decided to try to estimate thebaseline prevalence of high cholesterol (cholesterol > 240 mg/dL). They randomly sam-pled 400 individuals and asked them to return to the study center to have their cholesterolmeasured again, knowing that not all 400 would return for the re-test.
• The willingness of a participant to revisit the lab was correlated with the frailty of theindividual, sex, and prior knowledge of high cholesterol. With a lot of missing data, wewould expect to obtain biased of high cholesterol prevalence.
• Prevalence of high cholesterol at baseline was 43.1% in the Framingham cohort.
We consider three different scenarios:
A. Low response rate
B. Moderate reponse rate
C. High response rate
Exercise: Calculate the prevalence of high cholesterol for each of the three response ratesettings, as well as for the complete sample of 400 individuals.
proportion highchol
proportion highcholA highcholB highcholC
proportion highcholA
proportion highcholB
proportion highcholC
As suspected, bias increases with the amount of missingness.
We have baseline covariate data from the Framingham study. We can estimate the proba-bility that a sampled individual returns to have his/her cholesterol tested again as a function ofthese covariates.
If we knew these probabilities exactly, we could obtain an unbiased estimate of high choles-terol prevalence at baseline. In this example, we do have these probabilities (pA, pB, and pC
1
in the dataset).
Exercise: Calculate the prevalence of high cholesterol for each of the three response ratesettings using the survey weights, and compare to the complete-case data.
gen wA = 1/pA
gen wB = 1/pB
gen wC = 1/pC
proportion highchol
proportion highcholA [pweight=wA]
proportion highcholB [pweight=wB]
proportion highcholC [pweight=wC]
Here we recovered unbiased estimates. However in practice, we will never exactly know pA,
pB, and pC. Many methods have been developed to address survey non-response, includingmultiple imputation and weighting for non-response. Maximizing the response rate is alwaysthe best policy.
2
1
Tutorial: Correlation in Stata
7KH:RUOG%DQNKWWSGDWDZRUOGEDQNRUJLVDJUHDWVRXUFHRIIUHHSXEOLFGDWDRQWUHQGVLQKHDOWKDQGHFRQRPLFVDURXQGWKHZRUOG,QWKLVH[DPSOHZHXVHSXEOLFGDWDIURPWKH:RUOG%DQNWRH[DPLQHWUHQGVLQLPPXQL]DWLRQFRYHUDJHIRUPHDVOHVDQG'37RYHUWLPHLQORZLQFRPHFRXQWULHV2SHQWKHGDWDVHW:RUOG%DQNGWD
&DOFXODWHWKHSDLUZLVHFRUUHODWLRQVEHWZHHQPHDVOHVYDFFLQDWLRQFRYHUDJH'37YDFFLQDWLRQFRYHUDJHDQGWLPHpwcorr measles dpt year
0DNHDVFDWWHUSORWLQFOXGLQJERWKPHDVOHVDQGLPPXQL]DWLRQFRYHUDJHRQWKHSORW'RHVWKHSORWH[SODLQWKHUHVXOWVDERYH"twoway (scatter dpt year) (scatter measles year)
<HVWKHUHLVDYHU\VWURQJSRVLWLYHUHODWLRQVKLSEHWZHHQWLPHDQGLPPXQL]DWLRQFRYHUDJH)XUWKHULWVHHPVHYLGHQWWKDWWUHQGVLQVFDOLQJXSLQLPPXQL]DWLRQZHUHVLPLODUIRUPHDVOHVDQG'37
7HVWZKHWKHUWKHUHLVDOLQHDUUHODWLRQVKLSEHWZHHQWLPHDQGPHDVOHVYDFFLQDWLRQFRYHUDJH:KDWDUHWKHQXOODQGDOWHUQDWLYHK\SRWKHVHV":KDWLV\RXUFRQFOXVLRQ"pwcorr measles year, sig 6WDWLVWLFV!6XPPDULHV7DEOHVDQG7HVWV!6XPPDULHVDQG'HVFULSWLYH6WDWLVWLFV!3DLUZLVH&RUUHODWLRQV
7HVWIRUDPRQRWRQLFUHODWLRQVKLSEHWZHHQWLPHDQGPHDVOHVYDFFLQDWLRQFRYHUDJH:KDWDUHWKHQXOODQGDOWHUQDWLYHK\SRWKHVHV":KDWLV\RXUFRQFOXVLRQ"spearman measles year 6WDWLVWLFV!6XPPDULHVWDEOHVDQGWHVWV!1RQSDUDPHWULFWHVWVRIK\SRWKHVHV!6SHDUPDQ¶VUDQNFRUUHODWLRQ
:K\GR\RXWKLQNWKHFRUUHODWLRQVDUHVRKLJKLQWKLVH[DPSOH"6KRXOG\RXDOZD\VKDYHVXFKKLJKDVSLUDWLRQVUHJDUGLQJWKHPDJQLWXGHRI\RXUFRUUHODWLRQFRHIILFLHQWVZKHQDQDO\]LQJSXEOLFKHDOWKGDWD"
Source:&UHDWHGIURP:RUOG%DQN:RUOG'HYHORSPHQW,QGLFDWRUVDQG*OREDO'HYHORSPHQW)LQDQFH 9DFFLQDWLRQFRYHUDJHIURP:+2DQG81,&()
1
Tutorial: Non-Parametric Tests in Stata The Sign Test and Wilcoxon Signed-Rank Test
Consider the following table taken from Whitley and Ball (2002) showing central venous oxygen saturation in 10 patients at admission and 6 hours after admission to an intensive care unit (ICU). Table 1: Central Venous Oxygen Saturation (%)
Subject At admission 6 hours after admission to ICU 1 39.7 52.9 2 59.1 56.7 3 56.1 61.9 4 57.7 71.4 5 60.6 67.7 6 37.8 50.0 7 58.2 60.7 8 33.6 51.3 9 56.0 59.5
10 65.3 59.8 E. Whitley and J. Ball. Statistics review 6: Nonparametric methods. Crit Care. 2002; 6(6): 509–513. It is hypothesized that after 6 hours in the ICU central venous oxygen saturation should increase. The authors are interested in whether the apparent increase in central venous oxygen saturation is likely to reflect a genuine effect of admission and treatment or whether it is simply due to chance. The data are located in the CVOS.dta data set. In this example we want to know whether there is a difference in central venous oxygen saturation at admission compared to 6 hours after admission to the ICU. That is, we want to know whether 6 hours in the ICU has an effect on central venous oxygen saturation.
1. Are the data independent or dependent? What parametric and nonparametric tests are available for this type of data? Dependent: We measure central venous oxygen saturation at admission and 6 hours after admission on the same subject. Parametric test: paired t- test Non-parametric tests: sign test, Wilcoxon Signed-Rank Test
2. What type of statistical test is most appropriate for this data and why?
We should probably use a non-parametric test since we have a small sample size. Furthermore, we have no information to suggest that the differences are normally distributed. You could also make a histogram of the differences to inspect normality.
3. Suppose we decide to use the sign test. What are the null and alternative hypotheses?
The null hypothesis is that the median of the differences is equal to zero. The alternative is that the median of the differences is not equal to zero.
2
4. Perform a sign test in Stata at alpha = 0.05. What is the value of your test statistic? Your p-value? Your decision? Your interpretation? You may use the following drop-down menus to access the signtest command: Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Test equality of matched pairs. . signtest t6=t0 Sign test sign | observed expected -------------+------------------------ positive | 8 5 negative | 2 5 zero | 0 0 -------------+------------------------ all | 10 10 One-sided tests: Ho: median of t6 - t0 = 0 vs. Ha: median of t6 - t0 > 0 Pr(#positive >= 8) = Binomial(n = 10, x >= 8, p = 0.5) = 0.0547 Ho: median of t6 - t0 = 0 vs. Ha: median of t6 - t0 < 0 Pr(#negative >= 2) = Binomial(n = 10, x >= 2, p = 0.5) = 0.9893 Two-sided test: Ho: median of t6 - t0 = 0 vs. Ha: median of t6 - t0 != 0 Pr(#positive >= 8 or #negative >= 8) = min(1, 2*Binomial(n = 10, x >= 8, p = 0.5)) = 0.1094 Our test statistics is D = 8, since we have two plus signs. Stata uses the binomial distribution to generate the p-value. Our p-value is 0.1094. Thus, we fail to reject the null hypothesis and conclude that we do not find evidence that median central venous oxygen saturation is different at admission and 6 hours after admission to the ICU using the sign test.
5. Suppose that instead of conducting the sign test we conduct the Wilcoxon signed-rank test. Which test has more power? Why? The signed-rank test has more power since it incorporates the magnitude of differences via the ranks.
6. State the null and alternative hypothesis for the Wilcoxon signed-rank test.
3
The null hypothesis is that the median of the differences is equal to zero. The alternative is that the median of the differences is not equal to zero.
7. Perform a signed-rank test in Stata at the alpha = 0.05 level. What is the value of your test statistic? Your p-value? Your decision? Your interpretation? You may use the following drop-down menus to access the signrank command: Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Wilcoxon matched-pairs signed-rank test
. signrank t6=t0 Wilcoxon signed-rank test sign | obs sum ranks expected -------------+--------------------------------- positive | 8 50 27.5 negative | 2 5 27.5 zero | 0 0 0 -------------+--------------------------------- all | 10 55 55 unadjusted variance 96.25 adjustment for ties 0.00 adjustment for zeros 0.00 ---------- adjusted variance 96.25 Ho: t6 = t0 z = 2.293 Prob > |z| = 0.0218 Our test statistic is 2.293. The p-value is 0.0218. Therefore, we reject the null hypothesis. Thus, we have evidence that median central venous oxygen saturation is different at admission and 6 hours after admission to the ICU. It appears that central venous oxygen saturation is higher after 6 hours in the ICU.
Tutorial: Non-Parametric Tests in Stata 7KH:LOFR[RQ5DQN6XP7HVW
,QWKLVWXWRULDOZHZLOOXVHGDWDIURPWKH'LJLWDOLV,QYHVWLJDWLRQ*URXS',*3OHDVHUHDGWKHSURYLGHGGDWDGRFXPHQWDWLRQEHIRUHFRQWLQXLQJZLWKWKLVWXWRULDOVHH',*B'RFXPHQWDWLRQSGI:HZLOOUHSOLFDWHRQHRIWKHDQDO\VHVIURPWKH1HZ(QJODQG-RXUQDORI0HGLFLQHSDSHUVHH1(-0B',*SGI*DUJ5*RUOLQ56PLWK7<XVXI6IRUWKH'LJLWDOLV,QYHVWLJDWLRQ*URXS7KHHIIHFWRIGLJR[LQRQPRUWDOLW\DQGPRUELGLW\LQSDWLHQWVZLWKKHDUWIDLOXUHN Engl J Med1997,QWKLVWULDOSDWLHQWVZHUHUDQGRPL]HGWRHLWKHU'LJR[LQRUSODFHER7KH:LOFR[RQUDQNVXPWHVWZDVXVHGWRGHWHUPLQHLIWKHUHZHUHDQ\GLIIHUHQFHVEHWZHHQJURXSVLQWKHQXPEHURIKRVSLWDOL]DWLRQV7KHGDWDDUHORFDWHGLQWKHdig.dta GDWDVHW
([DPLQHWKHGLVWULEXWLRQRIQXPEHURIKRVSLWDOL]DWLRQVE\WUHDWPHQWJURXS$UHWKH\VLPLODU"$UHWKH\V\PPHWULF"
7KHWZRGLVWULEXWLRQVDUHYHU\VLPLODU7KH\DUHQRWV\PPHWULFEXWUDWKHUULJKWVNHZHG
'HQVLW\
QXPEHURIKRVSLWDOL]DWLRQV*UDSKVE\ SODFHER WUHDWPHQW
'RHVWKHUDQNVXPWHVWUHTXLUHDQ\DVVXPSWLRQV"<HV7KHWZRVDPSOHVPXVWEHLQGHSHQGHQWDQGWKHGLVWULEXWLRQVVKRXOGKDYHWKHVDPHJHQHUDOVKDSH
:KDWLVWKHQXOOK\SRWKHVLVIRUWKHUDQNVXPWHVW":KDWLVWKHDOWHUQDWLYH"7KHQXOOK\SRWKHVLVLVWKDWWKHPHGLDQQXPEHURIKRVSLWDOL]DWLRQVIRUWKHWZRWUHDWPHQWJURXSVDUHLGHQWLFDO7KXVWKHDOWHUQDWLYHLVWKDWWKHPHGLDQQXPEHURIKRVSLWDOL]DWLRQVIRUWKHWZRWUHDWPHQWJURXSVDUHQRWLGHQWLFDO6LQFHZHDVVXPHWKHWZRGLVWULEXWLRQVKDYHWKHVDPHJHQHUDOVKDSHDGLIIHUHQFHRIWKHPHGLDQVZRXOGLPSO\WKDWWKHWZRGLVWULEXWLRQVKDYHWKHVDPHVKDSHEXWDUHVKLIWHGLQORFDWLRQ
3HUIRUPDUDQNVXPWHVWLQ6WDWDZLWKDOSKD :KDWLV\RXUWHVWVWDWLVWLF"<RXUSYDOXH"<RXUGHFLVLRQ"<RXULQWHUSUHWDWLRQ"<RXPD\XVHWKHIROORZLQJGURSGRZQPHQXVWRDFFHVVWKHranksumFRPPDQG6WDWLVWLFV6XPPDULHVWDEOHVDQGWHVWV1RQSDUDPHWULFWHVWVRIK\SRWKHVHV:LOFR[RQUDQNVXPWHVW
. ranksum nhosp, by(trtmt) Two-sample Wilcoxon rank-sum (Mann-Whitney) test trtmt | obs rank sum expected -------------+--------------------------------- 0 | 3403 11767615 11571902 1 | 3397 11355786 11551499 -------------+--------------------------------- combined | 6800 23123400 23123400 unadjusted variance 6.552e+09 adjustment for ties -3.811e+08 ---------- adjusted variance 6.171e+09 Ho: nhosp(trtmt==0) = nhosp(trtmt==1) z = 2.491 Prob > |z| = 0.0127 2XUWHVWVWDWLVWLFLV7KHSYDOXHLV1RWHWKDWWKLVZDVWKHSYDOXHUHSRUWHGLQWKHSDSHU:HUHMHFWWKHQXOOK\SRWKHVLV7KXVZHFRQFOXGHWKDWZHKDYHHYLGHQFHWKDWWKHPHGLDQQXPEHURIKRVSLWDOL]DWLRQVGLIIHUE\WUHDWPHQWJURXS,QIDFWZHKDYHHYLGHQFHWKDWWKHUHVLJQLILFDQWO\PRUHKRVSLWDOL]DWLRQVLQWKHSODFHERJURXS
Tutorial: Simple Linear Regression
Open the dataset hospitaldata.dta.
Exercises:
1. Calculate the Pearson correlation for the percent of patients who say their nurse alwayscommunicated well (nursealways) and the percent of patients who would always recom-mend the hospital (recommendyes).
pwcorr recommendyes nursealways, sig
These two variables are correlated. However, simple linear regression gives us a moreintuitive measure of the relationship between the two variables. Specifically, we can state:”For a one percent increase in the percent of patients who say their nurse always com-municated well, we would, on average, expect to see a corresponding increase of B% ofpatients who would always recommend the hospital.” Here B is determined by fitting anappropriate linear regression model.
2. Now that you have established that these variables are correlated, you decide to fit a linearregression model to assess the relationship between recommendyes and nursealways.State your model.
Y i = percent of patients who always recommend the hospitalXi = perecnt of patients who say that the nurse always communicates well
Yi = ↵+ Xi + i
where i N(0,2). Equivalently, we could write:
µyi = E(Yi|Xi) = ↵+ Xi
where Yi N(µyi,2).
Goal is to estimate and obtain measures of uncertainty for ↵ and . We use the methodof least squares for estimation.
3. Construct a scatter plot with nursealways on the x-axis and recommendyes on the y axis.Use the scatterplot to evaluate the assumptions of simple linear regression.
twoway (scatter recommendyes nursealways)
Assumptions:
• Independent observations
1
• Y |X is normally distributed• Homoscedasticity (constant variance)• Linearity
4. Fit the linear regression model. Provide estimates, confidence intervals, and interpreta-tions of the regression coefficients ↵ and .
. regress recommendyes nursealways
Source | SS df MS Number of obs = 3570
-------------+------------------------------ F( 1, 3568) = 2723.72
Model | 144368.851 1 144368.851 Prob > F = 0.0000
Residual | 189118.972 3568 53.0041962 R-squared = 0.4329
-------------+------------------------------ Adj R-squared = 0.4327
Total | 333487.823 3569 93.4401297 Root MSE = 7.2804
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nursealways | 1.159487 .0222169 52.19 0.000 1.115928 1.203046
_cons | -19.21559 1.712829 -11.22 0.000 -22.57381 -15.85737
------------------------------------------------------------------------------
Our fitted regression line is:
Yi = 19.2 + 1.16Xi + i
where N(0, 7.32).
Confidence intervals for ↵ and , respectively, are (-22.57, -15.86) and (1.12, 1.2).
For a 1% increase in patients reporting their nurse communicated well, is correspondingaverage increase in the percent of patients who would always recommend the hospital1.16%.
↵ is the mean value of the response Yi when Xi = 0 and for this example has no mean-ingful interpretation. (However, it is necessary for constructing the regression line andmaking subsequent predictions).
5. Test the hypothesis that H0 : = 0 versus the alternative that HA : 6= 0.
We find that = 1.16, se() = 0.02, and t = 52.2. Under H0, t = /se() t35702,and our p-value < 0.0001. Therefore, we reject the null hypothesis and conclude that thepercent of patients who say a nurse always communicates well is positively correlatedwith the percent of patients who would always recommend a hospital.
6. What is the value of R2. Interpret this quantity.
0.433
2
43% of the variability among the observed values of recommendyes is explained by thelinear relationship with nursealways.
7. Examine a residual plot. Using R2 and the plot, does the model appear to fit well? (Arethere any outliers?)
rvfplot
rvpplot nursealways
We don’t see any strong trends or outliers in the residual plots.
8. Using the regression line, predict the expected percent of patients who always recom-mend the hospital when the reported percent of nurses who always communicate well is80%? Construct corresponding 95% confidence interval.
Denote Y 80 as the predicted average percent of patients who always recommend a hos-pital among hospitals with patients reporting that nurses always communicate well 80%of the time.
Y 80 = 19.2 + 1.16 80 = 73.6
. lincom _cons + 80*nursealways
( 1) 80*nursealways + _cons = 0
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 73.54339 .1399631 525.45 0.000 73.26897 73.8178
------------------------------------------------------------------------------
A 95% confidence interval for Y 80 is (73.2690, 73.8178).
9. For a new hospital with 80% of patients reporting that nurses always communicate well,predict the percent of patients who will always recommend the hospital. Construct corre-sponding 95% confidence interval.
Denote Y 80 as the predicted percent of patients who always recommend the hospital ina new hospital where patients reporting that nurses always communicate well 80% of thetime.
Y 80 = 73.54339. To find a confidence interval, we need to account for additional uncer-tainty associate with predicting a new outcome.
se(Y 80) =qvar(Y 80) + 2 =
p.13996312 + 7.28042 = 7.281745
. di 73.54339 - invttail(3568, 0.025)*7.281745
59.266589
3
. di 73.54339 + invttail(3568, 0.025)*7.281745
87.820191
So, a 95% confidence interval Y 80 is 73.54339± t3568,0.975 7.281745 = (59.27, 87.82).
4
Indicator Variables and Regression
Suppose a hospital is trying to set a benchmark goal of having patients report that nursesalways communicate well at least 75% of the time. We now define a nurse communicationindicator variable and use simple linear regression to further examine the relationship betweennurse communication and the percentage of patients always recommending the hospital.
Open the dataset hospitaldata.dta.
Exercises:
1. Generate a new variable, highnurse, that equals 1 if a hospital had nursealways 75%;and equals 0 if nursealways < 75%.
gen highnurse = .
replace highnurse = 1 if nursealways >= 75 & nursealways <= 100
replace highnurse = 0 if nursealways < 75
2. State your model and evaluate the model assumptions.
Yi = percent of patients who recommend the hospital alwaysDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise
Yi = ↵+ Di + i
where i N(0,2).
The model is identical to a one-way ANOVA therefore the assumputions we make are thesame. When we only have two groups, the assumptions are identical to the t-test withequal variances.
3. Fit the model.
xi: regress recommendyes i.highnurse
or
regress recommendyes highnurse
Source | SS df MS Number of obs = 3570
-------------+------------------------------ F( 1, 3568) = 1004.37
1
Model | 73254.0735 1 73254.0735 Prob > F = 0.0000
Residual | 260233.749 3568 72.9354678 R-squared = 0.2197
-------------+------------------------------ Adj R-squared = 0.2194
Total | 333487.823 3569 93.4401297 Root MSE = 8.5402
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
highnurse | 9.980834 .3149346 31.69 0.000 9.363364 10.5983
_cons | 62.86486 .2653319 236.93 0.000 62.34465 63.38508
------------------------------------------------------------------------------
So, our fitted model is Yi = 62.9 + 10.0 Di + i, where i N(0, 8.52).
4. Interpret the coefficients.
↵ = 62.9 is E(Yi|Di = 0). The average percent of patients who always recommend ahospital when less than 75% of patients say nurses always communicated well is 62.9%.
= 10.0 is E(Yi|Di = 1) E(Yi|Di = 0). Comparing hospitals with at least 75% ofpatients say nurses always communicated well with those where less than 75% of thepatients report that nurses always communicate well, the average difference in percent ofpatients who always recommend a hospital was 10%.
↵+ = 72.9 is E(Yi|Di = 1). The average percent of patients who always recommend ahospital when at least 75% of patients say nurses always communicated well is 72.9%.
5. Test the null hypothesis that there is no difference in the average percent of patients whoalways recommend a hospital between hospitals with less than and at least 75% of pa-tients reporting that nurses always communicate well.
We test H0 : = 0 versus HA : 6= 0 using a two-sided test with ↵ = 0.05.
We find that = 10.0, se() = 0.3, and t = 31.7. Under H0, t tn2, and p < 0.0001.We conclude that the average percent of patients who always recommend a hospital isgreater when at least 75% of patients report that nurses always communicate well.
6. Compare the results of the test above to a two-sample t-test with equal variances.
. ttest recommendyes, by(highnurse)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
2
---------+--------------------------------------------------------------------
0 | 1036 62.86486 .272132 8.759099 62.33087 63.39886
1 | 2534 72.8457 .1678457 8.449162 72.51657 73.17483
---------+--------------------------------------------------------------------
combined | 3570 69.9493 .1617829 9.666443 69.6321 70.2665
---------+--------------------------------------------------------------------
diff | -9.980834 .3149346 -10.5983 -9.363364
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -31.6918
Ho: diff = 0 degrees of freedom = 3568
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
You should notice some striking similarities!
3
Multiple Linear Regression
Now, the hospital aims to assess the impact of nurse communication and hospital noiselevel on the percentage of patients who would always recommend the hospital.
Fit a linear regression model with recommendyes as the outcome and nursealways andquietalways as the covariates.
1. Make a scatter plot of quietalways versus recommendyes.
twoway (scatter recommendyes quietalways)
While the relationship appears linear, note that we cannot assess any of the assumptionsof multiple linear regression using this plot.
2. State your model.
Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the nurse always communicates wellX2i = percent of patients who report that the hospital is always quiet
Yi = ↵+ 1X1i + 2X2i + i
where i N(0,2). Equivalently, we could write:
µyi = E(Yi|X1i, X2i) = ↵+ 1X1i + 2X2i
where Yi N(µyi ,2).
3. Fit the model.
regress recommendyes nursealways quietalways
Source | SS df MS Number of obs = 3570
-------------+------------------------------ F( 2, 3567) = 1363.40
Model | 144484.252 2 72242.126 Prob > F = 0.0000
Residual | 189003.571 3567 52.9867033 R-squared = 0.4333
-------------+------------------------------ Adj R-squared = 0.4329
Total | 333487.823 3569 93.4401297 Root MSE = 7.2792
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
1
-------------+----------------------------------------------------------------
nursealways | 1.133725 .0282517 40.13 0.000 1.078334 1.189116
quietalways | .0229694 .0155642 1.48 0.140 -.0075463 .053485
_cons | -18.58225 1.7655 -10.53 0.000 -22.04374 -15.12075
------------------------------------------------------------------------------
Our model is Yi = ↵+ 1.13X1i + 0.02X2i + i, where i N(0, 7.282).
4. Evaluate the model assumptions.The adjusted R2 is 0.43 (compared to 0.43 from the simple linear regression model withonly nursealways).
rvfplot
rvpplot nursealways
rvpplot quietalways
5. Interpret the coefficients.
We estimate 1 = 1.13, with 95% confidence interval (1.08, 1.19). For a one percentincrease in the patients who say that the nurses always communicate well, we see onaverage a 1.13 percent increase in the percent of patients who would always recommendthe hospital, when the percent of patients who say the hospital is always quiet is fixed(does not vary).
We estimate 2 = 0.02, with 95% confidence interval (0.01, 0.05). For a one percentincrease in the patients who say the hospital is always quiet, we see on average a 0.02percent increase in the percent of patients who would always recommend the hospital,fixing the percent of patients who say their nurse always communicates well.
We estimate that ↵ = 18.58. ↵ is the value of E(Yi) when X1i and X2i are set to 0. Inour dataset, the covariates never drop below 48% and 30% respectively, and therefore ↵does not have a meaningful interpretation for this study.
6. Suppose we consider a new hospital, where the percentage of nurses who always com-municate is 90% and the percentage of those who say the hospital is always quiet is 70%?What is the expected percent of patients who would always recommend this hospital?
E(Yi|X1i = 90, X2i = 70) = 18.58 + 1.13 90 + 0.02 70 = 84.5%.
7. Using the regression results above, perform the follow three hypothesis tests at the 0.05level of significance.
• H0 : 1 = 0, HA : 1 6= 0
2
1 = 1.13, se(1) = 0.03, t = 40.1. Under H0, t t357021, and p < 0.0001. Wereject H0 and conclude that an increase in the percent of patients who say nursesalways communicate well results in an increase in the percent of patients who alwaysrecommend the hospital, fixing the percent of patients who say the hospital is alwaysquiet.
• H0 : 2 = 0, HA : 2 6= 0
1 = 0.02, se(1) = 0.02, t = 1.5. Under H0, t t357021, and p = 0.14. We fail toreject H0 and conclude that we do not have evidence in the data that increasing thepercent of patients who say the hospital is always quiet is correlated with the percentof patients who always recommend the hospital, fixing the percent of patients whosay that the nurses always communicate well .
• H0 : 1 = 2 = 0, HA : one of1,2 6= 0
. test nursealways quietalways
( 1) nursealways = 0
( 2) quietalways = 0
F( 2, 3567) = 1363.40
Prob > F = 0.0000
Our F-statistic equals 1363.4. Under H0, F F2,3567, and p < 0.0001. We reject H0
and conclude that atleast one of 1 or 2 is non-zero.
8. Do we observe any collinearity between X1i and X2i. How does this impact the result.
twoway (scatter nursealways quietalways)
Yes, the covariates are collinear. We would likely see an association between X2i and Yiif X1i was excluded from the model.
3
More Multiple Linear Regression
For those interested in delving a bit deeper into the world of linear regression, a few addi-tional examples are included below. In the first example, you can work through a multiple linearregression model with one binary covariate and one continuous covariate. In the second exam-ple, we add an interaction between these covariates to examine effect modification/interactionbetween covariates in the context of linear regression. It is important to think about how theinterpretation of the regression coefficients changes in the presence of an interaction term.
Example 1:
Fit a linear regression model with recommendyes as the outcome and highnurse and quietalways
as the covariates.
1. Make a scatterplot with quietalways on the x-axis and recommendyes on the y-axis. Strat-ify by highnurse when you are plotting, so that you can distinguish between hospitals withhighnurse = 0 and highnurse = 1. Overlay a linear prediction line for highnurse = 0 andhighnurse = 1.
Via the dropdown menus, go to Graphics ! Two-way graph. Within the two-way window,you will need to create four different plots: two scatter plots (go to Basic plots ! Scatterand then fill in Y and X variables) and two linear prediction lines (go to Fit plots ! LinearPrediction and then fill in Y and X variables). Or, via command line:
twoway (scatter recommendyes quietalways if highnurse==1) ///
(scatter recommendyes quietalways if highnurse==0) ///
(lfit recommendyes quietalways if highnurse == 1)///
(lfit recommendyes quietalways if highnurse==0)
2. State your model.
Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise
Yi = ↵+ 1X1i + 2D2i + i
where i N(0,2).
3. Fit the model.
1
regress recommendyes highnurse quietalways
4. Evaluate the model assumptions.
Suggestions: Check the residual plots to look for outliers and heteroskedasticity. Do theresiduals look approximately normal? Patterns in the residual plot could suggest that yourmodel for the mean of the outcome is misspecified (linearity is violated).
5. Interpret the coefficients.
• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.
• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, for a given value of highnurse.
• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, fixingquietalways.
• ↵ + 801 + 2 - the average percent of patients who recommend a hospital withhighnurse = 1 and quietalways = 80.
• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.
2
Example 2 - Multiple Linear Regression with an Interaction
Now, we examine whether there is an interaction between highnurse and quietalways onrecommendyes. Equivalently, we look for evidence of effect modification of the relationship be-tween quietalways and recommendyes by highnurse.
1. Check out the scatter plot from the previous example. Does the plot suggest that an in-teraction term might improve the model?
Yes. We can look for evidence of effect modification by comparing the slopes of theoverlayed lines in the scatter plot. Because the slopes appear to differ by highnurse,there is evidence of effect modification.
2. State your model.
Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise
Yi = ↵+ 1X1i + 2Di + 3X1iDi + i
where i N(0,2).
3. Fit the model.
. xi: regress recommendyes i.highnurse*quietalways
4. Evaluate the model assumptions.
You can use the same approach as the previous question.
5. Interpret the coefficients.
• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.
• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, when highnurse = 0.
3
• 1 + 3 - the average increase in the percent of patients who always recommend ahospital for a one percent increase in quietalways, when highnurse = 1.
• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, whenquietalways equals 0. 2 does not have a meaningful interpretation in this analysis.Note that we could have centered the covariate quietalways around its mean, so thatthe covariate would be more interpretable.
• 2+703 - the average increase in the percent of patients who always recommend ahospital for hospitals with highnurse = 1 compared to hospitals with high nurse = 0,when quietalways equals 70.
• ↵ + 801 + 2 + 803 - the average percent of patients who recommend a hospitalwith highnurse = 1 and quietalways = 80.
• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.
4
More Multiple Linear Regression
For those interested in delving a bit deeper into the world of linear regression, a few addi-tional examples are included below. In the first example, you can work through a multiple linearregression model with one binary covariate and one continuous covariate. In the second exam-ple, we add an interaction between these covariates to examine effect modification/interactionbetween covariates in the context of linear regression. It is important to think about how theinterpretation of the regression coefficients changes in the presence of an interaction term.
Example 1:
Fit a linear regression model with recommendyes as the outcome and highnurse and quietalways
as the covariates.
1. Make a scatterplot with quietalways on the x-axis and recommendyes on the y-axis. Strat-ify by highnurse when you are plotting, so that you can distinguish between hospitals withhighnurse = 0 and highnurse = 1. Overlay a linear prediction line for highnurse = 0 andhighnurse = 1.
Via the dropdown menus, go to Graphics ! Two-way graph. Within the two-way window,you will need to create four different plots: two scatter plots (go to Basic plots ! Scatterand then fill in Y and X variables) and two linear prediction lines (go to Fit plots ! LinearPrediction and then fill in Y and X variables). Or, via command line:
twoway (scatter recommendyes quietalways if highnurse==1) ///
(scatter recommendyes quietalways if highnurse==0) ///
(lfit recommendyes quietalways if highnurse == 1)///
(lfit recommendyes quietalways if highnurse==0)
2. State your model.
Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise
Yi = ↵+ 1X1i + 2D2i + i
where i N(0,2).
3. Fit the model.
1
regress recommendyes highnurse quietalways
4. Evaluate the model assumptions.
Suggestions: Check the residual plots to look for outliers and heteroskedasticity. Do theresiduals look approximately normal? Patterns in the residual plot could suggest that yourmodel for the mean of the outcome is misspecified (linearity is violated).
5. Interpret the coefficients.
• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.
• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, for a given value of highnurse.
• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, fixingquietalways.
• ↵ + 801 + 2 - the average percent of patients who recommend a hospital withhighnurse = 1 and quietalways = 80.
• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.
2
Example 2 - Multiple Linear Regression with an Interaction
Now, we examine whether there is an interaction between highnurse and quietalways onrecommendyes. Equivalently, we look for evidence of effect modification of the relationship be-tween quietalways and recommendyes by highnurse.
1. Check out the scatter plot from the previous example. Does the plot suggest that an in-teraction term might improve the model?
Yes. We can look for evidence of effect modification by comparing the slopes of theoverlayed lines in the scatter plot. Because the slopes appear to differ by highnurse,there is evidence of effect modification.
2. State your model.
Y i = percent of patients who recommend the hospital alwaysX1i = percent of patients in a hospital who say that the hospital is always quietDi = 1 if at least 75% of patients at the hospital report that nurses communicate well, andis 0 otherwise
Yi = ↵+ 1X1i + 2Di + 3X1iDi + i
where i N(0,2).
3. Fit the model.
. xi: regress recommendyes i.highnurse*quietalways
4. Evaluate the model assumptions.
You can use the same approach as the previous question.
5. Interpret the coefficients.
• ↵ - the average percent of patients who always recommend a hospital when high-nurse is 0 and quietalways is 0. ↵ does not have a meaningful interpretation for thisstudy since quietalways never drops to 0.
• 1 - the average increase in the percent of patients who always recommend a hos-pital for a one percent increase in quietalways, when highnurse = 0.
3
• 1 + 3 - the average increase in the percent of patients who always recommend ahospital for a one percent increase in quietalways, when highnurse = 1.
• 2 - the average increase in the percent of patients who always recommend a hospi-tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, whenquietalways equals 0. 2 does not have a meaningful interpretation in this analysis.Note that we could have centered the covariate quietalways around its mean, so thatthe covariate would be more interpretable.
• 2+703 - the average increase in the percent of patients who always recommend ahospital for hospitals with highnurse = 1 compared to hospitals with high nurse = 0,when quietalways equals 70.
• ↵ + 801 + 2 + 803 - the average percent of patients who recommend a hospitalwith highnurse = 1 and quietalways = 80.
• ↵+801 - the average percent of patients who recommend a hospital with highnurse= 0 and quietalways = 80.
4
Simple Logistic Regression
Think back to Week 7, when we used the sample from the California Health Indicator Sur-vey (CHIS) to examine the relationship between poverty and visiting the doctor within the past12 months. This week, we use logistic regression to examine this relationship. Open thechis healthdisparities.dta dataset.
Fit a logistic regression model with visiting the doctor in the past 12 months as the outcomeand the poverty indicator as your covariate.
1. List the assumptions for performing logistic regression.
We assume the responses are Bernoulli, and we assume linearity in the parameters onthe logit scale.
2. State your model.
Define Yi = 1 if individual i visited the doctor in the last 12 months, 0 otherwise. DefineXi = 1 if the individual is above the poverty line, 0 otherwise. Then, our model is Yi Bernoulli(pi), where
logit(pi) = ↵+ Xi
3. Fit the model.
. logit doctor nopov
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -245.14765
Iteration 2: log likelihood = -245.08244
Iteration 3: log likelihood = -245.08242
Logistic regression Number of obs = 500
LR chi2(1) = 4.64
Prob > chi2 = 0.0312
Log likelihood = -245.08242 Pseudo R2 = 0.0094
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | .6713351 .3013476 2.23 0.026 .0807047 1.261965
_cons | .83975 .2745156 3.06 0.002 .3017093 1.377791
------------------------------------------------------------------------------
The fitted regression model is logit(pi) = 1.511 + 0.671Xi.
4. Interpret the coefficients.
1
• ↵ = log(odds of visiting the doctor when Xi = 0)• = log(odds ratio of visiting the doctor for no poverty versus poverty) = log(odds of
visiting doctor when Xi = 1) - log(odds of visiting doctor when Xi = 0)• ↵+ = log(odds of visiting the doctor when Xi = 1)
5. Provide an OR and a 95% confidence interval.
Hard way: exp() = 1.957 with 95% CI (exp(0.0807047), exp(1.261965)) = (1.084, 3.532).
Easy way:
. lincom nopov, eform
( 1) [doctor]nopov = 0
------------------------------------------------------------------------------
doctor | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.956848 .5896914 2.23 0.026 1.084051 3.532357
------------------------------------------------------------------------------
Another easy way:
. logistic doctor nopov
Logistic regression Number of obs = 500
LR chi2(1) = 4.64
Prob > chi2 = 0.0312
Log likelihood = -245.08242 Pseudo R2 = 0.0094
------------------------------------------------------------------------------
doctor | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | 1.956848 .5896914 2.23 0.026 1.084051 3.532357
_cons | 2.315788 .63572 3.06 0.002 1.352168 3.96613
------------------------------------------------------------------------------
6. What is the probability of visiting the doctor in the past 12 months for those above poverty?below poverty?
. predict phat
(option pr assumed; Pr(doctor))
Below poverty: 0.6984126Above poverty: 0.819222
7. Test the hypothesis that H0 : = 0 versus H0 : 6= 0 at the 0.05 level of significance. = 0.6713351, se() = .3013476, Z = 2.23.
Under H0, Z N(0, 1), and p = 0.026. We reject H0 and conclude being above thepoverty level is associated with higher odds of visiting the doctor within the past 12months.
2
Note that the 95% CI for excludes 0 and the 95% CI for the odds ratio excludes 1,leading to the same conclusion (as will always be the case).
8. Compare your results to the 2 2 table analysis from week 7.
Yes, our results match up to the contingency table analysis, as they should! The beautyof logistic regression is in its flexibility, as we see next.
3
Multiple Logistic Regression
Now, we expand the regression model, adding in more covariates. Add gender to your
model.
1. First, assume no effect modification by gender. State the model.
Define Yi = 1 if individual i visited the doctor in the last 12 months, 0 otherwise; X1i = 1if the individual is above the poverty line, 0 otherwise; X2i = 1 if female, 0 if male. Then,
our model is Yi Bernoulli(pi), where
logit(pi) = ↵+ 1X1i + 2X2i
2. Fit the model.
. logit doctor nopov female
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -229.36247
Iteration 2: log likelihood = -228.56747
Iteration 3: log likelihood = -228.56462
Iteration 4: log likelihood = -228.56462
Logistic regression Number of obs = 500
LR chi2(2) = 37.68
Prob > chi2 = 0.0000
Log likelihood = -228.56462 Pseudo R2 = 0.0761
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | .997763 .3245721 3.07 0.002 .3616134 1.633913
female | 1.384033 .2549714 5.43 0.000 .8842978 1.883767
_cons | -.0321554 .3246122 -0.10 0.921 -.6683837 .6040729
------------------------------------------------------------------------------
The fitted regression model is logit(pi) = .0322 + .998X1i + 1.384X2i.
3. Is there evidence of effect modification by gender?
Now, we fit the model
logit(pi) = ↵+ 1X1i + 2X2i + 3X1iX2i
and test whether 3 = 0.
. xi: logit doctor i.nopov*female
i.nopov _Inopov_0-1 (naturally coded; _Inopov_0 omitted)
i.nopov*female _InopXfemal_# (coded as above)
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -229.89318
1
Iteration 2: log likelihood = -228.56544
Iteration 3: log likelihood = -228.55916
Iteration 4: log likelihood = -228.55916
Logistic regression Number of obs = 500
LR chi2(3) = 37.69
Prob > chi2 = 0.0000
Log likelihood = -228.55916 Pseudo R2 = 0.0762
-------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
_Inopov_1 | .9619012 .472267 2.04 0.042 .036275 1.887528
female | 1.329136 .5835434 2.28 0.023 .1854119 2.47286
_InopXfemal_1 | .0678287 .6489728 0.10 0.917 -1.204135 1.339792
_cons | -3.76e-15 .4472136 -0.00 1.000 -.8765225 .8765225
-------------------------------------------------------------------------------
There is no evidence of effect modification by gender.
4. Is there evidence of confounding by gender?
Without gender: 1 = 0.671With gender: 1 = 0.998Yes, there is evidence of confounding by gender.
2
Logistic Regression with a Continuous Covariate
As in the previous tutorial, we fit a model to examine the relationship between visiting the doctorin the past 12 months and whether an individual is above or below the federal poverty level,conditional on gender. We fit a logistic regression model with doctor as the outcome, and withnopov and female as covariates. But now we add a continuous covariate age to the model!
Open the chis healthdisparities.dta dataset.
1. Assume that, conditional on age and gender, probability of visiting the doctor varies lin-early on the logit scale with age. State your model.
Define Yi = 1 if individual i visited the doctor in the last 12 months, 0 otherwise; X1i = 1if the individual is above the poverty line, 0 otherwise; X2i = 1 if female, 0 if male; andX3i = age in years. Then, our model is Yi Bernoulli(pi), where
logit(pi) = ↵+ 1X1i + 2X2i + 3X3i
2. Fit the model.
. logit doctor nopov female age
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -226.31928
Iteration 2: log likelihood = -225.22574
Iteration 3: log likelihood = -225.2222
Iteration 4: log likelihood = -225.2222
Logistic regression Number of obs = 500
LR chi2(3) = 44.36
Prob > chi2 = 0.0000
Log likelihood = -225.2222 Pseudo R2 = 0.0897
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | .9882978 .3271762 3.02 0.003 .3470442 1.629551
female | 1.334568 .2568062 5.20 0.000 .8312367 1.837899
age | .0187776 .0074311 2.53 0.012 .0042129 .0333423
_cons | -.8066067 .4469253 -1.80 0.071 -1.682564 .0693507
------------------------------------------------------------------------------
The fitted model is logit(pi) = .807 + .988X1i + 1.335X2i + 0.019X3i
3. Is there evidence that age is a confounder of the doctor-poverty relationship? Would youexpect age to be a confounder?
With gender only: 1 = 0.998With age and gender: 1 = 0.988No, there is not evidence of confounding by age.
1
4. Interpret the odds ratio.
. lincom nopov, eform
( 1) [doctor]nopov = 0
------------------------------------------------------------------------------
doctor | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 2.686657 .8790104 3.02 0.003 1.414879 5.101586
------------------------------------------------------------------------------
Conditioning on age and gender, the odds of visiting the doctor are 2.69 times higher(with 95% CI 1.41, 5.10) in those above the poverty line, compared to those below thepoverty line.
5. Test for an association between poverty and visiting the doctor in the past 12 months,conditioning on age and gender, at the 0.05 level of significance.
We test H0 : 1 = 0 versus H0 : 1 6= 0.
1 = .988, se(1) = .327, Z = 3.02.
Under H0, Z N(0, 1), and p = 0.003. (Note: the 95% CI for 1 excludes 0 and the 95%CI for the OR subsequently excludes 1.)
We reject H0 and conclude that there is evidence in the data that being above the povertyline increases the likelihood of visiting the doctor in the past 12 months, conditioning onage and gender.
6. Predict the probability of visiting the doctor for everyone in your dataset.
predict phat
7. What is the predicted probability of visiting the doctor for a 45 year old woman above thepoverty level? below the poverty level?
. lincom _cons + age*45 + female + nopov
( 1) [doctor]nopov + [doctor]female + 45*[doctor]age + [doctor]_cons = 0
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 2.361251 .2244562 10.52 0.000 1.921325 2.801177
------------------------------------------------------------------------------
. lincom _cons + age*45 + female + nopov*0
( 1) [doctor]female + 45*[doctor]age + [doctor]_cons = 0
2
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.372953 .3101936 4.43 0.000 .7649849 1.980921
------------------------------------------------------------------------------
. di invlogit(2.361251 )
.91382437
. di invlogit(1.372953)
.79785684
Above the poverty line: 91.2%Below the poverty line: 79.8%
3
Recap + Model Fit
Open the chis healthdisparities.dta dataset.
1. After fitting and considering several models, what are our conclusions about the relation-ship between poverty and visiting the doctor in the past 12 months?
Those below the poverty line appear less likely to visit the doctor in the past 12 months.
2. Compare the fit of these models.
There are several options for assessing the fit of a logistic regression model. We don’thave time to look at all of them (if you are interested, look at Hosmer-Lemeshow anddeviance). But, to relate back to week 3, let’s look at the ROC curve.
Fit the logistic regression model with doctor as the outcome and nopov, female, andage as covariates.
We choose a cut-off c and construct a classification table:Yi = 1 Yi = 0
pi > c Correct False +pi <= c False - Correct
For example, when c = 0.8:
. estat classification, cutoff(0.8)
Logistic model for doctor
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 243 25 | 268
- | 159 73 | 232
-----------+--------------------------+-----------
Total | 402 98 | 500
Classified + if predicted Pr(D) >= .8
True D defined as doctor != 0
--------------------------------------------------
Sensitivity Pr( +| D) 60.45%
Specificity Pr( -|~D) 74.49%
Positive predictive value Pr( D| +) 90.67%
Negative predictive value Pr(~D| -) 31.47%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 25.51%
False - rate for true D Pr( -| D) 39.55%
False + rate for classified + Pr(~D| +) 9.33%
False - rate for classified - Pr( D| -) 68.53%
1
--------------------------------------------------
Correctly classified 63.20%
--------------------------------------------------
To get the full ROC curve (and the area under the ROC curve), try lroc.
Plot the ROC curve for the three models above to visualize the improved classification ofthe more complex models. We could likely add more covariates to further improve thediscriminatory ability of the model.
2
PH207X Fall 2012 Survey Data Demo Page 1 of 11
Objectives for Survey Results Module 1 – Basic Statistics
Number of respondents at baseline and follow-up Number of participants in longitudinal dataset Differences between baseline and longitudinal dataset
I. Number of respondents at baseline and follow-up
a. Baseline survey – 9175 respondents b. Follow-up survey – 3700 respondents
II. Number of participants in longitudinal dataset
a. 596 participants provided unique identifiers in both the baseline and follow-up survey
III. Differences between baseline and longitudinal dataset
The tables below present information on those who responded at the baseline survey to those who were included in the longitudinal dataset for some selected variables which we will be using later in the demo.
Baseline (n=9157) Longitudinal (n=596) Sex female 3984 (44%) 282 (47%) male 4521 (49%) 310 (52%) missing 652 (7%) 4 (0.7%) Age 32±10.1 33.78±10.3 Computer Mac 1429 (16%) 84 (14%) PC 7080 (77%) 509 (85%) missing 648 (7%) 3 (0.5%) Aptitude math 3773 (41%) 287 (48%) verbal 4735 (52%) 305 (51%) missing 649 (7%) 4 (0.7%) Your Handedness righty 7582 (83%) 531 (89%) lefty 547 (6%) 44 (7%) ambidexterous 351 (4%) 18 (3%)
PH207X Fall 2012 Survey Data Demo Page 2 of 11
Objectives for Survey Results Module 2 – Factors Related to Mac vs PC Use
Choose a study design to examine the association between math and verbal aptitude and Mac/PC use.
Calculate the appropriate measure of association comparing math versus verbal aptitude and Mac/PC use.
Construct your own analysis to study the association between handedness and Mac/PC use.
I. Choose a study design
a. Exposure: Math and Verbal aptitude b. Outcome: Mac/PC Use c. Study design: Cross-sectional study
II. Calculate the appropriate measure of association comparing the math and verbal
aptitude and Mac/PC use. a. Measure of association – Risk Ratio or Odds Ratio b. Dropdown:
i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: macpc iii. Exposed variable: aptitude iv. On the options tab, check box for “Report odds ratio” v. Submit
c. Command Window Syntax: cs macpc aptitude,or | aptitude | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 39 45 | 84 Noncases | 248 258 | 506 -----------------+------------------------+------------ Total | 287 303 | 590 | | Risk | .1358885 .1485149 | .1423729 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | -.0126263 | -.0689729 .0437202 Risk ratio | .9149826 | .6150247 1.361235 Prev. frac. ex. | .0850174 | -.3612351 .3849753 Prev. frac. pop | .0413559 | Odds ratio | .9016129 | .5690484 1.428626 (Cornfield) +-------------------------------------------------
chi2(1) = 0.19 Pr>chi2 = 0.6609 People with stronger math abilities were about 9% less likely to use a Mac compared to people with stronger verbal abilities. The confidence interval for our risk ratio was 0.62 to 1.36
PH207X Fall 2012 Survey Data Demo Page 3 of 11
III. Construct your own analysis to study the association between Mac/PC use and handedness. a. Measure of association – Risk Ratio or Odds Ratio b. Dropdown:
i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: macpc iii. Exposed variable: lefty iv. On the options tab, check box for “Report odds ratio” v. Submit
c. Command Window Syntax: cs macpc lefty,or | lefty | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 6 77 | 83 Noncases | 38 452 | 490 -----------------+------------------------+------------ Total | 44 529 | 573 | | Risk | .1363636 .1455577 | .1448517 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | -.009194 | -.1149534 .0965653 Risk ratio | .9368359 | .4330182 2.026847 Prev. frac. ex. | .0631641 | -1.026847 .5669818 Prev. frac. pop | .0048503 | Odds ratio | .9268626 | .3889418 2.214373 (Cornfield) +------------------------------------------------- chi2(1) = 0.03 Pr>chi2 = 0.8678 The risk ratio for this study was 0.94 and the odds ratio was 0.93. This shows that people who are left-handed were less likely to use a Mac compared to people who are right-handed.
PH207X Fall 2012 Survey Data Demo Page 4 of 11
Objectives for Survey Results Module 3 –Risk Factors for Sleep Difficulties
Choose a study design to examine the association between tea/coffee consumption before bed and sleep difficulties.
Calculate the appropriate measure of association comparing tea/coffee consumption before bed and sleep difficulties.
Consider confounding and effect modification sex. Consider confounding and effect modification age. Construct your own analysis to study the association between handedness and sleep
difficulties. Consider confounding and effect modification by sex. I. Choose a study design to examine the association between tea/coffee
consumption before bed and sleep difficulties. a. Study design: Cohort study b. Exposure: Tea and coffee consumption two hours before bed c. Outcome: Sleep difficulties that night
II. Calculate the appropriate measure of association comparing tea/coffee
consumption before bed and sleep difficulties. a. Measure of association: Risk ratio b. Dropdown:
i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff iii. Exposed variable: caff2hrb4 iv. Submit.
c. Command Window Syntax: cs sleepdiff caff2hrb4 | caff2hrb4 | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 19 81 | 100 Noncases | 102 389 | 491 -----------------+------------------------+------------ Total | 121 470 | 591 | | Risk | .1570248 .1723404 | .1692047 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | -.0153156 | -.0885836 .0579524 Risk ratio | .9111315 | .5763827 1.440294 Prev. frac. ex. | .0888685 | -.4402942 .4236173 Prev. frac. pop | .0181947 | +------------------------------------------------- chi2(1) = 0.16 Pr>chi2 = 0.6886 Those who drank tea or coffee before bed had 0.91 times the risk of sleep difficulties compared to those who did not drink tea or coffee.
PH207X Fall 2012 Survey Data Demo Page 5 of 11
III. Consider confounding and effect modification by sex. a. Dropdown:
i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff iii. Exposed variable: caff2hrb4 iv. Go to the “Options” tab;; click the box next to “stratify on variables”;;
use the dropdown menu to select “male” Note: Under “Within-stratum weights” the button next to “Use Mantel-Haenszel” should be automatically selected
v. Submit. b. Command Window Syntax: cs sleepdiff caff2hrb4, by(male)
male | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- no | .6689266 .3338178 1.34044 9.448399 yes | 1.241935 .6697457 2.302969 7.068404 -----------------+------------------------------------------------- Crude | .9111315 .5763827 1.440294 M-H combined | .9141471 .5777317 1.446458 ------------------------------------------------------------------- Test of homogeneity (M-H) chi2(1) = 1.721 Pr>chi2 = 0.1895 The crude effect estimate is 0.911 while the Mantel-Haenszel adjusted risk ratio is 0.914. Since the crude and adjusted-risk ratios are so similar, we can conclude that there is not strong confounding by sex in our study. Although our risk ratios for males and females seem different, there is no evidence of statistically significant effect modification by sex.
IV. Consider confounding and effect modification by age.
a. Dropdown: i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc. ii. Case variable: sleepdiff iii. Exposed variable: caff2hrb4 iv. Go to the “Options” tab;; click the box next to “stratify on variables”;;
use the dropdown menu to select “agecat” Note: Under “Within-stratum weights” the button next to “Use Mantel-Haenszel” should be automatically selected
v. Submit. b. Command Window Syntax: cs sleepdiff caff2hrb4,
by(agecat)
PH207X Fall 2012 Survey Data Demo Page 6 of 11
agecat | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- 18-29 yrs old | 1.202553 .654413 2.209817 7.208 30-39 yrs old | .5083612 .1870783 1.381406 6.040404 40-49 yrs old | .8452381 .2120323 3.369427 1.976471 >=50 yrs old | 1.305556 .3431283 4.967457 1.309091 -----------------+------------------------------------------------- Crude | .9111315 .5763827 1.440294 M-H combined | .9143836 .5795498 1.442667 ------------------------------------------------------------------- Test of homogeneity (M-H) chi2(3) = 2.389 Pr>chi2 = 0.4957 The crude risk ratio is 0.911 while the Mantel-Haenszel adjusted risk ratio is 0.914. Since the crude and adjusted-risk ratios are so similar, there is not strong confounding by age category in our study. Despite the differences in the risk ratios by age category, there is no evidence of statistically significant effect modification by age category. V. Construct your own analysis to study the association between handedness and
sleep difficulties. Consider confounding and effect modification by sex. a. Dropdown:
i. Statistics Epidemiology and RelatedTables for EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff iii. Exposed variable: lefty iv. Submit
b. Command Window Syntax: cs sleepdiff lefty
| lefty | | Exposed Unexposed | Total -----------------+------------------------+------------ Cases | 8 89 | 97 Noncases | 36 440 | 476 -----------------+------------------------+------------ Total | 44 529 | 573 | | Risk | .1818182 .168242 | .1692845 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | .0135762 | -.1047616 .131914 Risk ratio | 1.080695 | .5614644 2.080098 Attr. frac. ex. | .0746692 | -.7810567 .5192533 Attr. frac. pop | .0061583 | +------------------------------------------------- chi2(1) = 0.05 Pr>chi2 = 0.8175 People who are left-handed have a slightly higher (1.08 fold higher) risk of sleep difficulties compared people who are right-handed.
PH207X Fall 2012 Survey Data Demo Page 7 of 11
c. Dropdown: vi. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc. vii. Case variable: sleepdiff viii. Exposed variable: lefty ix. Go to the “Options” tab;; click the box next to “stratify on variables”;;
use the dropdown menu to select “male” Note: Under “Within-stratum weights” the button next to “Use Mantel-Haenszel” should be automatically selected
x. Submit. d. Command Window Syntax: cs sleepdiff lefty, by(male)
male | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- female | .9338374 .3691631 2.362241 3.918519 male | 1.333333 .531464 3.345058 2.8 -----------------+------------------------------------------------- Crude | 1.080695 .5614644 2.080098 M-H combined | 1.100331 .5725662 2.114564 ------------------------------------------------------------------- Test of homogeneity (M-H) chi2(1) = 0.288 Pr>chi2 = 0.5918 Our results stratified by gender show slightly different results among males and females. Left-handed males have 1.33 times the risk of sleep difficulties compared to right-handed males while left-handed females have 0.93 times of the risk of sleep difficulties compared to right-handed males.
PH207X Fall 2012 Survey Data Demo Page 8 of 11
Objectives for Survey Results Module 4 –Risk Factors for Left and Right Handedness
Choose a study design to examine the association between mother’s age at birth of PH207x participant and handedness of the participant.
Calculate the appropriate measure of association comparing the mother’s age among those who are left-handed to those who are right-handed.
Construct your own analysis to study the association between having a left-handed parent and child’s handedness.
I. Choose a study design
a. Exposure: Mother’s age at birth of PH207x participant b. Outcome: Handedness of PH207x participant c. Study design: Case-control
II. Calculate the appropriate measure of association comparing the mother’s age among
those who are left-handed to those who are right-handed. a. Measure of association – Odds ratio b. Calculating the odds ratio in Stata c. Dropdown:
i. Statistics Epidemiology and RelatedTables for EpidemiologistsCase control odds ratio.
ii. Case variable: lefty iii. Exposed variable: momagecat iv. Submit
d. Command Window Syntax:
Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 9 35 | 44 0.2045 Controls | 57 474 | 531 0.1073 -----------------+------------------------+------------------------ Total | 66 509 | 575 0.1148 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 2.138346 | .8576737 4.829077 (exact) Attr. frac. ex. | .5323488 | -.1659446 .7929211 (exact) Attr. frac. pop | .1088895 | +-------------------------------------------------
chi2(1) = 3.78 Pr>chi2 = 0.0519
Mothers who are 35 years of age or older at the time of their child’s birth have 2.14 times the odds of having a left-handed child compared to mothers who were younger than 35 at the time of their child’s birth. We are 95% confident that the true odds ratio ranges from 0.86 to 4.83.
PH207X Fall 2012 Survey Data Demo Page 9 of 11
III. Construct your own analysis to study the association between having at least one left-handed parent and child’s handedness.
. cc lefty parentlefty Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 6 38 | 44 0.1364 Controls | 42 487 | 529 0.0794 -----------------+------------------------+------------------------ Total | 48 525 | 573 0.0838 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 1.830827 | .597326 4.708769 (exact) Attr. frac. ex. | .4537988 | -.6741276 .7876302 (exact) Attr. frac. pop | .0618817 | +-------------------------------------------------
chi2(1) = 1.72 Pr>chi2 = 0.1900
People with at least one left-handed parent have 1.83 times the odds of being left-handed compared to those without a left-handed parent.
Conclusions The appropriate measure of association depends on the type of exposure and outcome of
interest, the type of data available and the study design used to obtain the data. In survey studies, one must always be concerned about issues of selection bias and
generalizability.
PH207X Fall 2012 Survey Data Demo Page 10 of 11
Data Dictionary for Survey Dataset
Variable Description Values
board
In the past two weeks how often did you use the chat room for this course to post a question
"0" "2-3 times" "4 or more times" "Never" "Once"
male Sex 0 no (female) 1 yes (male) . missing
degree Highest level of education
1 pre-college / university degree 2 bachelor degree 3 masters degree 4 doctoral degree
precollege Highest level of education is Pre-College/University Degree
0 no 1 yes . missing
masters Highest level of education is Masters Degree
0 no 1 yes . missing
doctorate Highest level of education is Doctoral Degree
0 no 1 yes . missing
macpc Which type of computer do you use most of the time?
0 pc 1 mac . missing
aptitude Which is stronger, your math aptitude or your verbal aptitude?
0 verbal 1 math . missing
caff2hrb4 Did you drink coffee or tea within two hours of bedtime yesterday?
0 no 1 yes . missing
sleepdiff Did you have trouble sleeping last night?
0 no 1 yes . missing
shower Do you face the shower head? 0 no 1 yes . missing
longhair Do you consider your hair to be long? 0 no 1 yes . missing
PH207X Fall 2012 Survey Data Demo Page 11 of 11
Variable Description Values
facialhair If you are a man, do you have a beard, mustache, or goatee?
0 no 1 yes . missing
agecat Age of participant
1 18-29 yrs old 2 30-39 yrs old 3 40-49 yrs old 4 >=50 yrs old
momagecat How old was your mother at your birth? 0 <35 yrs old 1 >=35 yrs old
lefty Are you left-handed? 0 righty 1 lefty . missing
dadlefty Is your father left-handed? 0 righty 1 lefty . missing
momlefty Is your mother left-handed? 0 righty 1 lefty . missing
parentlefty Is one (or both) of your parents left-handed?
0 No left-handed parents 1 Left-handed parent . missing
allhourscat On average, how many hours per week did you spend on all aspects of this course?
0 0-7 hours 1 >=8 hours
hwkhourscat On average, how many hours per week did you spend working on the homework assignments for this course?
0 0-2 hours 1 >=3 hours
comphrscat For how many hours did you use your computer last night
0 0-1 hours 1 >=2 hours
1
Tutorial: Survival Analysis in Stata
In this tutorial, we use data from the Digitalis Investigation Group (DIG). Recall that the DIG trial was a was a randomized, double-blind, multicenter trial designed to examine the safety and efficacy of Digoxin in treating patients with congestive heart failure. In this trial, patients were randomized to either Digoxin or placebo. The log-rank test was used to compare overall mortality between the two groups.
To begin, open the dig.dta data set. Before we can do any analyses, we must first tell Stata that we are working with survival data (analogous to how we had to svyset our data and tell Stata that we were working with survey data). You can do this using the stset command. The command for this dataset is stset deathday, failure(death==1). This command tells Stata that our time-to-death variable is deathday; and a value of 1 for the death variable means that person died while any other value (in this case 0) means that person was censored. For survival data, we need at least two variables: 1) a variable for the time to the event and 2) a variable to indicate if the observation is censored or not.
1. Graph the Kaplan-Meier estimates of the survival curves for each treatment group. .sts graph, by(trtmt)
2
Note: you can also list the values of the survival function using the sts list, by(trtmt) command.
2. In the New England of Journal paper (see handout NEJM_DIG), the authors plotted 1 – S(t) in Figure 1. Graph 1 – S(t) for each treatment group. sts graph, failure by(trtmt)
0.00
0.25
0.50
0.75
1.00
0 500 1000 1500 2000analysis time
trtmt = 0 trtmt = 1
Kaplan-Meier survival estimates
3
3. Conduct a log-rank test at the 0.05 level of significance to test the hypothesis that the survival distribution is the same in the two treatment groups. Use the following command: sts test trtmt, logrank
a. What are your null and alternative hypotheses? The null hypothesis is that the two groups have same distribution of survival times. The alternative is that they do not.
b. What is the value of your test statistic? 0.00
c. What distribution does your test statistic have under the null hypothesis? Chi-squared distribution with 1 degree of freedom
d. What is your p-value?
0.9616. Note: using all 6800 observation yields a p-value of 0.8013. This is the p-value the authors reported in Figure 1.
0.00
0.25
0.50
0.75
1.00
0 500 1000 1500 2000analysis time
trtmt = 0 trtmt = 1
Kaplan-Meier failure estimates
4
e. What is your conclusion?
Since our p-value is greater than 0.05, we fail to reject the null hypothesis. Thus, we conclude that we do not have evidence that the distribution of survival times is different between the Digoxin group and the placebo group.